IssamLaradji / sps

Official code for the Stochastic Polyak step-size optimizer
136 stars 22 forks source link

Train with weight decay and momentum #4

Open milliema opened 3 years ago

milliema commented 3 years ago

I'm using SLS to train my own model, but I found it's different to train with plain SGD or SGD+wd+mom. When I use plain SGD, the step size increase at first, following exponential trend, which is consistent with you published work. However, if I use SGD+weight decay+momentum, the step size is very stable (0.02~0.03) for most of the time. Can you explain why? Is SPS incompatible with optimizer momentum and weight decay?

IssamLaradji commented 3 years ago

We have noticed the same behaviour with the step size when incorporating momentum. I am not sure why that is happening, but our team is investigating this phenomenon, because it is an interesting behavior.

milliema commented 3 years ago

Thanks for your reply.

We have noticed the same behavior with the step size when incorporating momentum.

So the behavior is only related with momentum? Did you test with weight decay or not? I guess it may because the weight norm increase when momentum is adopted, the grad norm may increase as well, so computed step size decreases. But if we use weight decay+momentum, normally the weight norm is stable, that makes me confused with the results I get. BTW, have you ever tested SPS or SLS on larger datasets (e.g. ImageNet)? The idea seems very interesting and promising for diverse applications.