CERC-AAI / multimodal

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.
Apache License 2.0
8 stars 3 forks source link

Add data checkpoint within epoch feature #17

Closed floatingbigcat closed 1 year ago

floatingbigcat commented 1 year ago

Description

In the case when our dataset is super large, and we want to let the model walk through the dataset without replacement, may only for one or few epochs.
We can't do the training with oneshot due to time limition wall for each job. We need to add support to let the model dataloader recover from certain iter (within one epoch)

Solution open_clip has give a solution that slice all shards into many sub set. And for each "sub_epoch" it walk through one sub set. Record our sub_epoch number and use it when start training to do the data checkpoint. https://github.com/mlfoundations/open_clip/pull/535

floatingbigcat commented 1 year ago

https://github.com/AGI-Collective/multimodal/pull/29