allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.24k stars 399 forks source link

[wip] comparing vanilla torch model and clipping with OLMo FSDP, no_shard and OLMo clipping #577

Open ananyahjha93 opened 2 months ago

ananyahjha93 commented 2 months ago

image

There is an order of magnitude difference between the losses between the two setups. @dirkgr @epwalsh can you sanity check the OLMo grad_clipping code for FSDP no_shard/DDP?

On 3 batches being sent to the model again and again, the model should overfit and the loss should go to 0. We can see that in the screenshot for vanilla pytorch run.

Comparing the two runs:

  1. OLMo model used in both
  2. no FSDP vs FSDP-no_shard (32 bit)
  3. torch optim and clipping vs OLMo optim and clipping
  4. OLMo scheduler used in both
dirkgr commented 2 months ago

Can you make this a draft PR? I don't think we'll merge this?

dirkgr commented 2 months ago
ananyahjha93 commented 2 months ago

image

Sending torch optimizer and gradient clipping with no_shard seems to fix it!