Closed kayuksel closed 5 years ago
LGTM now.
@kayuksel I will have a try with AdaBoundW. If anything is ok, I will update the README and release a new version to include it. But I am really busy with some other issues of AdaBound these days, so it might take some time.
BTW, I would like to add a contributors section and add your name. Should it be @kayuksel or @akaniklaus in #13? I don't know your relationship lol.
Any other questions? :D
@Luolc Ah, that's a very nice of you. Never been listed as a contributor :)
They're both my accounts, you an add this one which is @kayuksel, Thnx!!!
Hi, sorry for the late chime in on this @kayuksel @Luolc
It seems that in AdaBoundW
the weight decay is applied without first multiplying by the step size. This is different to what is done in the official pytorch AdamW
implementation.
See here: https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L73
This means that if we have a lr
of 1e-3 and weight_decay
of 1e-3, in the official AdamW
the effective weight decay will be 1e-6
whereas in AdaBoundW
the effective weight decay will be 1e-3
.
Any suggestions?
An AdaBound version with decoupled weight decay, which has been implemented to the code as an additional class, as it has been discussed in the recent issue #13.