allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.31k stars 419 forks source link

start_index not getting reset in data loader when moving to new epoch #650

Open leon-g-xu opened 1 month ago

leon-g-xu commented 1 month ago

🐛 Describe the bug

When a training job resumes from a checkpoint, it resumes from the epoch and start_index saved in the checkpoint. The start_index is being set in the data loader. However this start_index does not get reset to 0 when the current epoch finishes and next epoch starts. So new epoch will still read the data from the old start_index.

start_index loaded from checkpoint: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L377 how start_index is used in data loader(and it didn't get reset) : https://github.com/allenai/OLMo/blob/main/olmo/data/iterable_dataset.py#L133-L135

Versions

olmo 0.3.0

leon-g-xu commented 1 month ago

One solution is to reset the start index to be 0 on the next epoch. I am not sure if there's any setting that I missed.

AkshitaB commented 3 weeks ago

@epwalsh I believe you already fixed this. Can you confirm?

leon-g-xu commented 3 weeks ago

If this is already fixed, can you share the commit/PR that fixes this?

epwalsh commented 3 weeks ago

Yeup, fixed here: https://github.com/allenai/OLMo/commit/a3e2ea7b598f1342990045c33fed5027a6b56611