Open leon-g-xu opened 1 month ago
One solution is to reset the start index to be 0 on the next epoch. I am not sure if there's any setting that I missed.
@epwalsh I believe you already fixed this. Can you confirm?
If this is already fixed, can you share the commit/PR that fixes this?
🐛 Describe the bug
When a training job resumes from a checkpoint, it resumes from the epoch and start_index saved in the checkpoint. The start_index is being set in the data loader. However this start_index does not get reset to 0 when the current epoch finishes and next epoch starts. So new epoch will still read the data from the old start_index.
start_index loaded from checkpoint: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L377 how start_index is used in data loader(and it didn't get reset) : https://github.com/allenai/OLMo/blob/main/olmo/data/iterable_dataset.py#L133-L135
Versions
olmo 0.3.0