radam var_thresh hyperparam

lgvaz commented 4 years ago

As we can see in the original RAdam implementation here it may be useful to modify the maximum RAdam threshold depending on your problem.

I very simply remove the previous hardcoded valued (4) and add it as a parameter called var_thresh.

review-notebook-app[bot] commented 4 years ago

Check out this pull request on

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

sgugger commented 4 years ago

This hyper-parameter doesn't really make sense, which is why I removed it. The 4 is there because the quantity used after aren't defined for r <= 4 but afterward, you're just ignoring steps with a very low lr (since v is very close to 0) which 1. can't hurt, 2. is the whole point of using Radam, having a warmup.

The effective learning rate is plotted in the notebook just after the RAdam optimizer, where we can see this only impacts a few iterations at the very beginning of training. Going from 4 to 5 for instance, at a beta2 of 0.99, changes the behavior of the optimizer for one iteration, so I have trouble believing this impacts training in any way.

Closing this PR, please reopen with empirical evidence it can be useful as it doesn't make sense theoretically (and is not mentioned anywhere in the paper).

fastai / fastai_dev

radam var_thresh hyperparam #303