Closed nviwch closed 1 year ago
Lion seems promising with the results Google released but there seems to be disagreement on how to tune the hyperparameters. Some tests I've seen show suggestions different from what the authors of the paper suggest. Would like to see how the optimizer does when it is able to adapt.
I am looking at developing D-Adaptation for Lion soon.
I've been experimenting with Lion a lot. In my experiments it doesn't seem to work well for the typical benchmark problems I normally test on, which I believe is due to the batch size being somewhat small. signSGD family methods like Lion tend to require very large batch sizes to work well.
I have tried using batch sizes that 3-4 times more than usual, and also 3 times lower learning rate with lion instead, and it seems to work fairly better than my previous optimizations.
I've added an initial implementation on branch lion: https://github.com/facebookresearch/dadaptation/blob/lion/dadaptation/dadapt_lion.py Let me know if it works for you. It needs more testing before I merge it.
I have tried d-adapt Lion as the d-adapt learning rate is always increasing, the loss became nan after some time. I used cosine learning rate decay with warmup. What scheduler should I use for stable training?
It's important to use a schedule with learning rate warmup, that might help.
I have tried d-adapt Lion as the d-adapt learning rate is always increasing, the loss became nan after some time. I used cosine learning rate decay with warmup. What scheduler should I use for stable training?
i think use gradient accumulation steps to increasing the batch size may get better results. Generally, a batch size of 64 or above is considered for lion.
I am very curious what happened if Lion optimizer combine with d-adapation. As in the paper and experiment points out, Lion is quite sensitive to learning rate. If you choose the right learning rate, it can surpass AdamW
https://github.com/lucidrains/lion-pytorch