facebookresearch / dadaptation

D-Adaptation for SGD, Adam and AdaGrad
MIT License
501 stars 19 forks source link

Add Lion with d-adaptation #20

Closed nviwch closed 1 year ago

nviwch commented 1 year ago

I am very curious what happened if Lion optimizer combine with d-adapation. As in the paper and experiment points out, Lion is quite sensitive to learning rate. If you choose the right learning rate, it can surpass AdamW

https://github.com/lucidrains/lion-pytorch

JosephLYH commented 1 year ago

Lion seems promising with the results Google released but there seems to be disagreement on how to tune the hyperparameters. Some tests I've seen show suggestions different from what the authors of the paper suggest. Would like to see how the optimizer does when it is able to adapt.

adefazio commented 1 year ago

I am looking at developing D-Adaptation for Lion soon.

I've been experimenting with Lion a lot. In my experiments it doesn't seem to work well for the typical benchmark problems I normally test on, which I believe is due to the batch size being somewhat small. signSGD family methods like Lion tend to require very large batch sizes to work well.

JosephLYH commented 1 year ago

I have tried using batch sizes that 3-4 times more than usual, and also 3 times lower learning rate with lion instead, and it seems to work fairly better than my previous optimizations.

adefazio commented 1 year ago

I've added an initial implementation on branch lion: https://github.com/facebookresearch/dadaptation/blob/lion/dadaptation/dadapt_lion.py Let me know if it works for you. It needs more testing before I merge it.

nviwch commented 1 year ago

I have tried d-adapt Lion as the d-adapt learning rate is always increasing, the loss became nan after some time. I used cosine learning rate decay with warmup. What scheduler should I use for stable training?

adefazio commented 1 year ago

It's important to use a schedule with learning rate warmup, that might help.

sdbds commented 1 year ago

I have tried d-adapt Lion as the d-adapt learning rate is always increasing, the loss became nan after some time. I used cosine learning rate decay with warmup. What scheduler should I use for stable training?

i think use gradient accumulation steps to increasing the batch size may get better results. Generally, a batch size of 64 or above is considered for lion.