allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.37k stars 431 forks source link

DDP training tries to save sharded checkpoint on the last step #664

Closed ananyahjha93 closed 1 month ago

ananyahjha93 commented 1 month ago

🐛 Describe the bug

2024-07-17T04:30:51.202069810Z 2024-07-16 21:30:51.201  jupiter-cs-aus-121.reviz.ai2.in:0   olmo.train:1268 INFO    Saving final checkpoint...
2024-07-17T04:30:52.220928528Z 2024-07-16 21:30:52.219  jupiter-cs-aus-121.reviz.ai2.in:5   olmo.util:163   CRITICAL    Uncaught AssertionError: TorchLegacyShardedCheckpointer is being called to save a model where `distributed_strategy` is not FSDP.

With DDP, when last step count is divisible by save_interval_unsharded, this checkpoint has already been saved and hence the condition to save the last checkpoint for DDP goes to sharded checkpoint saver TorchLegacyShardedCheckpointer because of the following if condition: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L1256

Versions

NA