allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.2k stars 392 forks source link

adding DDP to the codebase #612

Closed ananyahjha93 closed 3 weeks ago

ananyahjha93 commented 3 weeks ago

Need to add a test to check DDP works well with checkpointing.

ananyahjha93 commented 3 weeks ago

@epwalsh done!

ananyahjha93 commented 3 weeks ago

at the 300M size, it follows the FSDP model loss curve, being better initially, but both lines showing the same trend after sometime

image