2024-07-17T04:30:51.202069810Z 2024-07-16 21:30:51.201 jupiter-cs-aus-121.reviz.ai2.in:0 olmo.train:1268 INFO Saving final checkpoint...
2024-07-17T04:30:52.220928528Z 2024-07-16 21:30:52.219 jupiter-cs-aus-121.reviz.ai2.in:5 olmo.util:163 CRITICAL Uncaught AssertionError: TorchLegacyShardedCheckpointer is being called to save a model where `distributed_strategy` is not FSDP.
With DDP, when last step count is divisible by save_interval_unsharded, this checkpoint has already been saved and hence the condition to save the last checkpoint for DDP goes to sharded checkpoint saver TorchLegacyShardedCheckpointer because of the following if condition: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L1256
🐛 Describe the bug
With DDP, when last step count is divisible by
save_interval_unsharded
, this checkpoint has already been saved and hence the condition to save the last checkpoint for DDP goes to sharded checkpoint saverTorchLegacyShardedCheckpointer
because of the following if condition: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L1256Versions
NA