Open ananyahjha93 opened 2 months ago
Can you make this a draft PR? I don't think we'll merge this?
torch optim and clipping vs OLMo optim and clipping
Find out which one of these makes the difference?
Sending torch optimizer and gradient clipping with no_shard seems to fix it!
There is an order of magnitude difference between the losses between the two setups. @dirkgr @epwalsh can you sanity check the OLMo grad_clipping code for FSDP no_shard/DDP?
On 3 batches being sent to the model again and again, the model should overfit and the loss should go to 0. We can see that in the screenshot for vanilla pytorch run.
Comparing the two runs: