Luolc / AdaBound

An optimizer that trains as fast as Adam and as good as SGD.
https://www.luolc.com/publications/adabound/
Apache License 2.0
2.9k stars 330 forks source link

AdaBoundW #14

Closed kayuksel closed 5 years ago

kayuksel commented 5 years ago

An AdaBound version with decoupled weight decay, which has been implemented to the code as an additional class, as it has been discussed in the recent issue #13.

Luolc commented 5 years ago

LGTM now.

@kayuksel I will have a try with AdaBoundW. If anything is ok, I will update the README and release a new version to include it. But I am really busy with some other issues of AdaBound these days, so it might take some time.

BTW, I would like to add a contributors section and add your name. Should it be @kayuksel or @akaniklaus in #13? I don't know your relationship lol.

Any other questions? :D

kayuksel commented 5 years ago

@Luolc Ah, that's a very nice of you. Never been listed as a contributor :)

They're both my accounts, you an add this one which is @kayuksel, Thnx!!!

Laksh1997 commented 4 years ago

Hi, sorry for the late chime in on this @kayuksel @Luolc

It seems that in AdaBoundW the weight decay is applied without first multiplying by the step size. This is different to what is done in the official pytorch AdamW implementation.

See here: https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L73

This means that if we have a lr of 1e-3 and weight_decay of 1e-3, in the official AdamW the effective weight decay will be 1e-6 whereas in AdaBoundW the effective weight decay will be 1e-3.

Any suggestions?