Slowing Down the Weight Norm Increase in Momentum-based Optimizers> Normalization techniques, such as batch normalization (BN), have led to
significant improvements in deep neural network performances. Prior studies
have analyzed the benefits of the resulting scale invariance of the weights for
the gradient descent (GD) optimizers: it leads to a stabilized training due to
the auto-tuning of step sizes. However, we show that, combined with the
momentum-based algorithms, the scale invariance tends to induce an excessive
growth of the weight norms. This in turn overly suppresses the effective step
sizes during training, potentially leading to sub-optimal performances in deep
neural networks. We analyze this phenomenon both theoretically and empirically.
We propose a simple and effective solution: at each iteration of momentum-based
GD optimizers (e.g. SGD or Adam) applied on scale-invariant weights (e.g. Conv
weights preceding a BN layer), we remove the radial component (i.e. parallel to
the weight vector) from the update vector. Intuitively, this operation prevents
the unnecessary update along the radial direction that only increases the
weight norm without contributing to the loss minimization. We verify that the
modified optimizers SGDP and AdamP successfully regularize the norm growth and
improve the performance of a broad set of models. Our experiments cover tasks
including image classification and retrieval, object detection, robustness
benchmarks, and audio classification. Source code is available at
https://github.com/clovaai/AdamP.> https://arxiv.org/abs/2006.08217### link preview?
linkpreview> Get link (URL) preview> ![](https://pypi.org/static/images/twitter.90915068.jpg =250x)> https://pypi.org/project/linkpreview/
articla
cool.