manual_backward + fp16 training doesn't converge

Zasder3 / train-CLIP

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

MIT License

653 stars 78 forks source link

manual_backward + fp16 training doesn't converge #33

Open DrJimFan opened 2 years ago

DrJimFan commented 2 years ago

Hi, I borrowed some snippets from your codebase for the distributed GPU and minibatch-within-batch training in my own project. However, I found that training using manual_backward() + FP16 does not converge at all. If I switch to FP32, training works without any other code modifications. I'm using the latest pytorch-lightning v1.6.3. I wonder if you have observed similar issues?

RZachLamberty commented 1 year ago

I saw something similar, fwiw -- exploding gradients in the gradient rescaling from the very first forward pass. I read in other threads online that this is somewhat common in transformer architectures, especially those that include parameters smaller than the smallest possible 16bit float -- 6.1e-5, which is I guess not uncommon.