clovaai / AdamP

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights (ICLR 2021)
https://clovaai.github.io/AdamP/
MIT License
415 stars 52 forks source link

Could it be equivalent to normalize the weights ? #13

Closed axeldavy closed 3 years ago

axeldavy commented 3 years ago

Thanks the authors for the very interesting paper and analysis.

I was wondering if an equivalent fix for the weight growth could be to normalize the weights of layers before normalization layers during training ? For example every 10 mini-batches I would normalize the weights, such that the operation remains cheap.

SanghyukChun commented 3 years ago

Hi @axeldavy That will not be equivalent to our results, but I think that could be empirically useful. In theory, your suggestion does not guarantee the convergence of the optimizer because

Therefore, in terms of theory, I cannot guarantee your solution will have guaranteed convergence (with the optimal convergence rate), but in practice, since the perfect convergence is not desired property by the modern neural network communities, your solution could work. But, IMO, your solution could be hyperparameter sensitive (namely, the interval for normalization, the initial learning rate, and the learning rate scheduling)

axeldavy commented 3 years ago

Thank you for this very interesting answer !