[wip] comparing vanilla torch model and clipping with OLMo FSDP, no_shard and OLMo clipping

ananyahjha93 commented 2 months ago

There is an order of magnitude difference between the losses between the two setups. @dirkgr @epwalsh can you sanity check the OLMo grad_clipping code for FSDP no_shard/DDP?

On 3 batches being sent to the model again and again, the model should overfit and the loss should go to 0. We can see that in the screenshot for vanilla pytorch run.

Comparing the two runs:

OLMo model used in both
no FSDP vs FSDP-no_shard (32 bit)
torch optim and clipping vs OLMo optim and clipping
OLMo scheduler used in both

dirkgr commented 2 months ago

Can you make this a draft PR? I don't think we'll merge this?

dirkgr commented 2 months ago

no FSDP vs FSDP-no_shard (32 bit)
torch optim and clipping vs OLMo optim and clipping

Find out which one of these makes the difference?

ananyahjha93 commented 2 months ago

Sending torch optimizer and gradient clipping with no_shard seems to fix it!

allenai / OLMo

[wip] comparing vanilla torch model and clipping with OLMo FSDP, no_shard and OLMo clipping #577