why not AdamW style weight decay

nestordemeure commented 3 years ago

Hello,

While translating your optimizer to Flax (here), I noticed that you are using a traditional weight decay were you add the weight decay to the gradient (here in your implementation):

grad += weight_decay * parameters

Rather than an AdamW style weight decay (which, I believe, is now the default for most optimizers) were you would subtract the weight decay time the learning rate just before returning the parameters:

updated_parameters -= learning_rate * weight_decay * param

Is there a particular reason for that decision ?

adefazio commented 3 years ago

There is no particular reason why we currently use the traditional weight decay form. I haven't experimented with AdamW style decay with MADGRAD yet. If you find it works on your problem, let me know!

nestordemeure commented 3 years ago

I tried the following form on a personnal dataset (regression on tabular data):

updated_parameters -= (1. - beta) * power(learning_rate, 2./3.) * weight_decay * param

And got better result than what I had with the default weight decay method (but it might be due to parameter tuning or not generalize to other dataset).

The multiplication by (1. - beta) * power(learning_rate, 2./3.) rather than learning_rate is an effort to use the actual step size rather than the raw learning rate (lr / cbrt(lr) = lr^(2/3)). With that scaling the weight decay I had previously tuned for Adam seemed to work best which is practical.

adefazio commented 3 years ago

Yes that's a good idea in terms of weighting the learning rate. I had considered that however if you use a changing learning rate over time it will result in odd behavior. I.e. if you decrease the learning rate 10x like for ImageNet training midway through training, it won't result in 10x decrease in practice. But it works for scaling the initial learning rate.

nestordemeure commented 3 years ago

A friend just test both the default weight decay and a AdamW style weight decay for picture classication. He found that, using the default weight decay, he got no improvements (even with low values) whereas he had his best test score so far with a AdamW-style weight decay.

Overall it seems worth using.

adefazio commented 3 years ago

I will look into adding the adamw style weight decay as an option, thanks for the discussion and results!

russelldc commented 3 years ago

updated_parameters -= (1. - beta) * power(learning_rate, 2./3.) * weight_decay * param
@nestordemeure Would you mind helping me out with trying to implement this in the pytorch version here? I believe I've got everything in place, but what would beta be here? So far, I couldn't find an appropriate equivalent variable in the original MADGRAD implementation.

nestordemeure commented 3 years ago

beta is momentum in this implementation (here). I called it beta in my own code to stay consistent with the usual naming scheme.

russelldc commented 3 years ago

Perfect, thanks for the quick reply!

kgalias commented 2 years ago

Is this implemented somewhere? Or is there an easy change to the code here which makes it run this way?

I assume the change should remove line 119 and change something in line 170.