Could it be equivalent to normalize the weights ?

axeldavy commented 3 years ago

Thanks the authors for the very interesting paper and analysis.

I was wondering if an equivalent fix for the weight growth could be to normalize the weights of layers before normalization layers during training ? For example every 10 mini-batches I would normalize the weights, such that the operation remains cheap.

SanghyukChun commented 3 years ago

Hi @axeldavy That will not be equivalent to our results, but I think that could be empirically useful. In theory, your suggestion does not guarantee the convergence of the optimizer because

Our theoretical results rely on Arora et al. Theoretical Analysis of Auto Rate-Tuning by Batch Normalization [link]
Arora et al. showed that "scale-invariant parameters" are free to the learning rate choice (even with a high constant learning rate) for the convergence. Note that usually, the convergence of gradient descent with a high constant learning rate is not trivial (in terms of theory)
This is because the norm increases of the scale-invariant parameters == lowing down the effective learning rate of the scale-invariant parameters. Arora et al. theoretically showed that due to this reason, the SGD optimizer will be converged with the optimal convergence rate, even if a high constant learning rate is applied to scale-invariant parameters.
Your solution violates the theoretical results of Arora et al. because Arora et al. assume that the norm is monotonically increases => therefore the effective learning rate is monotonically decreases with a some factor (+ our theory shows that the factor is changed when momentum is applied)

Therefore, in terms of theory, I cannot guarantee your solution will have guaranteed convergence (with the optimal convergence rate), but in practice, since the perfect convergence is not desired property by the modern neural network communities, your solution could work. But, IMO, your solution could be hyperparameter sensitive (namely, the interval for normalization, the initial learning rate, and the learning rate scheduling)

axeldavy commented 3 years ago

Thank you for this very interesting answer !

clovaai / AdamP

Could it be equivalent to normalize the weights ? #13