In the case when our dataset is super large, and we want to let the model walk through the dataset without replacement, may only for one or few epochs.
We can't do the training with oneshot due to time limition wall for each job. We need to add support to let the model dataloader recover from certain iter (within one epoch)
Solution
open_clip has give a solution that slice all shards into many sub set. And for each "sub_epoch" it walk through one sub set. Record our sub_epoch number and use it when start training to do the data checkpoint.
https://github.com/mlfoundations/open_clip/pull/535
Description
In the case when our dataset is super large, and we want to let the model walk through the dataset without replacement, may only for one or few epochs.
We can't do the training with oneshot due to time limition wall for each job. We need to add support to let the model dataloader recover from certain iter (within one epoch)
Solution open_clip has give a solution that slice all shards into many sub set. And for each "sub_epoch" it walk through one sub set. Record our sub_epoch number and use it when start training to do the data checkpoint. https://github.com/mlfoundations/open_clip/pull/535