Closed lgvaz closed 4 years ago
Check out this pull request on
You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.
This hyper-parameter doesn't really make sense, which is why I removed it. The 4 is there because the quantity used after aren't defined for r <= 4 but afterward, you're just ignoring steps with a very low lr (since v is very close to 0) which 1. can't hurt, 2. is the whole point of using Radam, having a warmup.
The effective learning rate is plotted in the notebook just after the RAdam optimizer, where we can see this only impacts a few iterations at the very beginning of training. Going from 4 to 5 for instance, at a beta2 of 0.99, changes the behavior of the optimizer for one iteration, so I have trouble believing this impacts training in any way.
Closing this PR, please reopen with empirical evidence it can be useful as it doesn't make sense theoretically (and is not mentioned anywhere in the paper).
As we can see in the original RAdam implementation here it may be useful to modify the maximum RAdam threshold depending on your problem.
I very simply remove the previous hardcoded valued (
4
) and add it as a parameter calledvar_thresh
.