Closed axeldavy closed 3 years ago
Hi @axeldavy That will not be equivalent to our results, but I think that could be empirically useful. In theory, your suggestion does not guarantee the convergence of the optimizer because
Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
[link]the norm increases of the scale-invariant parameters
== lowing down the effective learning rate of the scale-invariant parameters
. Arora et al. theoretically showed that due to this reason, the SGD optimizer will be converged with the optimal convergence rate, even if a high constant learning rate is applied to scale-invariant parameters.monotonically increases
=> therefore the effective learning rate is monotonically decreases
with a some factor
(+ our theory shows that the factor is changed when momentum is applied)Therefore, in terms of theory, I cannot guarantee your solution will have guaranteed convergence (with the optimal convergence rate), but in practice, since the perfect convergence is not desired property by the modern neural network communities, your solution could work. But, IMO, your solution could be hyperparameter sensitive (namely, the interval for normalization, the initial learning rate, and the learning rate scheduling)
Thank you for this very interesting answer !
Thanks the authors for the very interesting paper and analysis.
I was wondering if an equivalent fix for the weight growth could be to normalize the weights of layers before normalization layers during training ? For example every 10 mini-batches I would normalize the weights, such that the operation remains cheap.