Open xiaosuyu1997 opened 7 months ago
It seems like StatefulDataLoader
from torchdata
might help here. However, if I replace my old data loader with StatefulDataLoader
, I cannot find a corresponding entry in the saved checkpoint. The warning doesn't appear, either.
I am experiencing the same problem when resuming training with a huge data scale in one epoch. I would agree to support the skipping batch logic as in hugging face train script.
Description & Motivation
LLMs are trained on growing size of corpora, only resume by epochs is not enough, as models may only be trained on a few epochs and one epoch may take a few days to train. Currently lightning prints a warning message as follows when trying to resume from mid steps inside an epoch and asks for a resumable dataloader:
However, I can't find any examples resuming from mid steps in docs/blogs(maybe my bad). And it's quite strange to me to implement a dataloader with state_dict/load_state_dict methods, as dataloader cannot hold states by design, it's the iterator derived from dataloader that is resumable and should hold the necessary states. Besides, we may not need the state_dict and load_state_dict methods to save/load dataloaders, as the epoch/step idx hold enough message to restore the necessary training batch state.
I proposed a possible hackin that can work around this issue, taking inspirations from hugging face train script.
Pitch
No response
Alternatives
Here is an ugly hackin(by callbacks in LightningModule) now I used to resume the specific batch:
Additional context
No response
cc @borda