facebookresearch / dadaptation

D-Adaptation for SGD, Adam and AdaGrad
MIT License
501 stars 19 forks source link

[Suggestions]D-Adaptation for Adan(Adaptive Nesterov Momentum Algorithm) #14

Closed qwopqwop200 closed 1 year ago

qwopqwop200 commented 1 year ago

Hi, thanks for your excellent works. Adan is currently a SOTA optimizer.And I think that applying D-Adaptation to Adan would be very practical. So I implemented this. D-Adaptation Adan, D-Adaptation Adan IP However, I'm not sure if this implementation is correct. It works very well, at least on cifar 10. It also shows higher performance than D-Adaptation Adam. 217195448-7202126f-6682-4fb0-9c99-432f534a9c9c

pseudo code

217242205-efcb5d6e-9123-4ce4-bf31-3ffcefb002b2

where λ is the weight decay constant.

Please let me know if there are any mistakes.

adefazio commented 1 year ago

Hi, I will run some experiments on this and see how it goes. If you provide a pull request I will be able to merge it in if the experiments show promise.