gbaydin / hypergradient-descent

Hypergradient descent
MIT License
136 stars 20 forks source link

AdamW #5

Closed akaniklaus closed 5 years ago

akaniklaus commented 5 years ago

Thank you very much for this beautiful work. Since that Adam has a generalization issue in the case of L2 regularization, it would be great if you can also provide an implementation of HD for AdamW. There is an implementation here: https://github.com/mpyrozhok/adamwr

I am also recently experimenting with Padam and QHAdam but couldn't obtain any improvement via them on the RL problem that I am working on. Do you have any thoughts about them?

                if group['weight_decay'] != 0:
                    decayed_weights = torch.mul(p.data, group['weight_decay'])
                    p.data.addcdiv_(-step_size, exp_avg, denom)
                    p.data.sub_(decayed_weights)
                else:
                    p.data.addcdiv_(-step_size, exp_avg, denom)
akaniklaus commented 5 years ago

I did the combination, can share upon your request.