Open brando90 opened 3 years ago
This paper shows that "the Rectified Adam (RAdam) algorithm can be characterized as four steps of momentum SGD, followed by Adam with a fixed warmup schedule." So, we may use the Adam with a warmup schedule instead when we need RAdam.
My implementation: https://github.com/Tony-Y/pytorch_warmup
Hi @Tony-Y, I'm curious why you prefer to use Adam with a warmup instead of RAdam.
I think the very basic fact both papers agree on, is that it's necessary to include warmup to handle the variance of adaptive learning rate. Their linear warmup schedule of 2/(1−β2) steps is just a further approximation to our derived first-order approximation.
Thanks for the question @brando90 and thanks for pointing me the PR. It is really an encouragement to me seeing that PR.
From my perspective, I don't think being included as an official module in PyTorch matters that much. The initiative of our study is to show the adaptive learning rate may cause some problems (the strongest evidence is the controlled experiments, i.e., Adam-2k v.s. Adam w.o. warmup). RAdam serves as a role to further verify our intuition on this matter. Although I'm very happy to see our optimizer helped & inspired many researchers, our optimizer is still experimental. It takes a lot of efforts to take the optimizer really to the next level.
We have been working on something new these two years, stay tuned : -)
Hi Liyuan,
Great to hear form you!
I am curious, what do you mean by "It takes a lot of efforts to take the optimizer really to the next level."? There aren't many hyperparameters to tune so I am curious what that means.
Looking forward to your next opitmizer!
@Tony-Y I am also curious to know why you prefer warm up vs RAdam - especially since RAdam seems quite robust and remove hypoer parameters (which are the ML researcher's nightmare!)
I think that a new approach introduced by RAdam is only a nonlinear warmup. Such nonlinear warmups may outperform the untuned linear warmup sometimes.
@Tony-Y the original paper you cited "On the Adequacy of Untuned Warmup for Adaptive Optimization" claims that RAdam is just equivalent to Adam + Warm up. From that perspective, it makes no difference which one of the too I use. Isn't it simpler to just fork the RAdam repo then git clone it and then use RAdam? RAdam is just a standard pytorch optimizer so using it is trivial.
(My guess is) the other alternative is to use the hugging face warm-up (which I've never used) https://huggingface.co/transformers/main_classes/optimizer_schedules.html?highlight=cosine#transformers.get_cosine_schedule_with_warmup and then use the linear schedule the paper you linked suggested.
In the end with the claim that they are "equivalent" either algorithm is fine. I will go with RAdam for now since it's already downloaded in my code and it's just as simple to use compared to the other - unless of course you have code that makes it trivial to plug in or have a convincing case beyond they are equivalent.
If you think warm-up is better perhaps a tutorial on how to use your warm-up version would be great to make it just as simple to plug in as RAdam. :)
I am looking forward to see how this debate on optimizers on transformers progresses.
I am curious, why hasn't RAdam been included official in pytorch?
https://github.com/pytorch/pytorch/issues/24892