allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.48k stars 449 forks source link

[wip] comparing vanilla torch model and clipping with OLMo FSDP no_shard and OLMo clipping #576

Closed ananyahjha93 closed 4 months ago

ananyahjha93 commented 4 months ago

image

There is an order of magnitude difference between the losses between the two setups. @epwalsh can you sanity check the OLMo grad_clipping code for FSDP no_shard?