Hi, thanks for your excellent works.
Adan is currently a SOTA optimizer.And I think that applying D-Adaptation to Adan would be very practical. So I implemented this. D-Adaptation Adan, D-Adaptation Adan IP
However, I'm not sure if this implementation is correct. It works very well, at least on cifar 10. It also shows higher performance than D-Adaptation Adam.
Hi, I will run some experiments on this and see how it goes. If you provide a pull request I will be able to merge it in if the experiments show promise.
Hi, thanks for your excellent works. Adan is currently a SOTA optimizer.And I think that applying D-Adaptation to Adan would be very practical. So I implemented this. D-Adaptation Adan, D-Adaptation Adan IP However, I'm not sure if this implementation is correct. It works very well, at least on cifar 10. It also shows higher performance than D-Adaptation Adam.
pseudo code
where λ is the weight decay constant.
Please let me know if there are any mistakes.