Open DrJimFan opened 2 years ago
I saw something similar, fwiw -- exploding gradients in the gradient rescaling from the very first forward pass. I read in other threads online that this is somewhat common in transformer architectures, especially those that include parameters smaller than the smallest possible 16bit float -- 6.1e-5
, which is I guess not uncommon.
Hi, I borrowed some snippets from your codebase for the distributed GPU and minibatch-within-batch training in my own project. However, I found that training using
manual_backward()
+ FP16 does not converge at all. If I switch to FP32, training works without any other code modifications. I'm using the latest pytorch-lightning v1.6.3. I wonder if you have observed similar issues?