Open akaniklaus opened 5 years ago
I would also investigate how it can help enabling the "Super Convergence": https://arxiv.org/abs/1708.07120
Partially Adaptive Momentum Estimation. Given that there are research that supports the idea of switching from Adam to SGD at later epochs for better generalization, I have done an implementation of this by starting the parameter and decaying it to a hypertuned lower value (between 0.0 and 1.0). I am curious if the proposed method here can also provide a better dynamic way of achieving this.
Adaptive Gradient Methods with Dynamic Bound of Learning Rate: https://arxiv.org/abs/1902.09843
Merhaba Atilim,
I would like to share few ideas for extension of the method.
1) Warm Restarts: It would be great to use the method in a cyclic learning-rate fashion. I have tried to reset the learning-rate externally whenever it is lower than a value and decayed the initial learning-rate to which reset according to the epoch. But I am sure that you can come up with a mathematically more robust way of doing this. https://arxiv.org/abs/1608.03983
2) Sparsification: The method offers a good way of detecting the convergence in order to sparsify the smallest weights of the network as it has been proven to be useful in the case of dense-sparse-dense training (https://arxiv.org/abs/1607.04381). Below is a code to perform such sparsification:
p.data.addcdiv_(-step_size, exp_avg, denom**partial)