The calculation of number of epochs are incorrect

facebookresearch / MobileLLM

MobileLLM Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. In ICML 2024.

Other

925 stars 47 forks source link

The calculation of number of epochs are incorrect #6

Closed westn closed 2 months ago

westn commented 2 months ago

First, Thanks for releasing the code and being as open as possible with it.

I noted that the epoch logging seems to be based on sys.maxsize instead of the actual size of the current dataset. Is this a low-hanging fruit to fix?

https://github.com/facebookresearch/MobileLLM/blob/1e67bf6a831b0d3863217f6a1dfd187919f636ac/pretrain.py#L91-L93

This also affects if you use the parameter --num_train_epochs 1 for the script.

zxdmike commented 2 months ago

This is set by design. The dataloader doesn't preload all the data as the data file could be huge; it continuously streams data, so the total length per epoch is unknown. Instead of using --num_train_epochs to control training duration, we use --max_steps.