Open blefaudeux opened 3 years ago
Love this idea! I think it will bring more stability to various models in AMP :)
Just an idea which popped when discussing with a researcher: ideally we should have an optimizer which fuses the scaling part (same as our Adam/CUDA), and in that case having a scale per param group could keep the speed (for pointwise optimizers we could still process all the tensors in one go), while opening up more degrees of freedom for fp16 to fit
🚀 Feature
The GradScaler as supported in Torch AMP handles an optimal scaling factor in between fp32 and fp16 to catch possible underflow and overflows, as explained here.
Implement a multiscale grad scaler, either per param or or per param group, while trying to stick as close as possible to the current GradScaler API.
Motivation
One limitation of the current GradScaler is that there's a single scaling factor for the whole model, which means that this will only work if a single "window" is enough across the model for all the gradients. Empirically, it seems that this is a limitation for some very deep or hard to initialize models, which means that users fall back to not using AMP in that case (and loose the Tensor cores benefits on a V100 for instance).
Pitch
Enable Torch AMP for everyone, with a one-stop shop.
Alternatives
Not doing that, meaning that many people cannot use Torch AMP and fall back to fp32.
Additional context
Discussed internally at FB, got good feedback from a couple of users.