DDP training tries to save sharded checkpoint on the last step

🐛 Describe the bug

2024-07-17T04:30:51.202069810Z 2024-07-16 21:30:51.201  jupiter-cs-aus-121.reviz.ai2.in:0   olmo.train:1268 INFO    Saving final checkpoint...
2024-07-17T04:30:52.220928528Z 2024-07-16 21:30:52.219  jupiter-cs-aus-121.reviz.ai2.in:5   olmo.util:163   CRITICAL    Uncaught AssertionError: TorchLegacyShardedCheckpointer is being called to save a model where `distributed_strategy` is not FSDP.

With DDP, when last step count is divisible by save_interval_unsharded, this checkpoint has already been saved and hence the condition to save the last checkpoint for DDP goes to sharded checkpoint saver TorchLegacyShardedCheckpointer because of the following if condition: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L1256

allenai / OLMo

DDP training tries to save sharded checkpoint on the last step #664

🐛 Describe the bug

Versions