gbaydin / hypergradient-descent

Hypergradient descent
MIT License
138 stars 20 forks source link

Ideas for Extension #6

Open akaniklaus opened 5 years ago

akaniklaus commented 5 years ago

Merhaba Atilim,

I would like to share few ideas for extension of the method.

1) Warm Restarts: It would be great to use the method in a cyclic learning-rate fashion. I have tried to reset the learning-rate externally whenever it is lower than a value and decayed the initial learning-rate to which reset according to the epoch. But I am sure that you can come up with a mathematically more robust way of doing this. https://arxiv.org/abs/1608.03983

2) Sparsification: The method offers a good way of detecting the convergence in order to sparsify the smallest weights of the network as it has been proven to be useful in the case of dense-sparse-dense training (https://arxiv.org/abs/1607.04381). Below is a code to perform such sparsification:

def sparsify(module, sparsity=0.25):
    for m in module.modules():
        if hasattr(m, 'weight') and isinstance(m, (nn.Conv1d, nn.Linear)):
            wv = m.weight.data.view(-1)
            mask = torch.zeros(m.weight.size()).byte().cuda()
            k = int(math.floor(sparsity*wv.numel()))
            smallest_idx = wv.abs().topk(k, dim=0, largest=False)[1]
            mask.view(-1)[smallest_idx] = 1
            m.weight.data.masked_fill_(mask, 0.)
  1. Partially Adaptive Momentum Estimation. Given that there are research that supports the idea of switching from Adam to SGD at later epochs for better generalization, I have done an implementation of this by starting the parameter and decaying it to a hypertuned lower value (between 0.0 and 1.0). I am curious if the proposed method here can also provide a better dynamic way of achieving this.

p.data.addcdiv_(-step_size, exp_avg, denom**partial)

kayuksel commented 5 years ago

I would also investigate how it can help enabling the "Super Convergence": https://arxiv.org/abs/1708.07120

kayuksel commented 5 years ago

Partially Adaptive Momentum Estimation. Given that there are research that supports the idea of switching from Adam to SGD at later epochs for better generalization, I have done an implementation of this by starting the parameter and decaying it to a hypertuned lower value (between 0.0 and 1.0). I am curious if the proposed method here can also provide a better dynamic way of achieving this.

Adaptive Gradient Methods with Dynamic Bound of Learning Rate: https://arxiv.org/abs/1902.09843