Implement Lion, up to 5x faster than Adam, and more accurate

FluxML / Optimisers.jl

Optimisers.jl defines many standard optimisers and utilities for learning loops.

https://fluxml.ai/Optimisers.jl

MIT License

72 stars 20 forks source link

Implement Lion, up to 5x faster than Adam, and more accurate #156

Closed PallHaraldsson closed 1 year ago

PallHaraldsson commented 1 year ago

Motivation and description

https://arxiv.org/abs/2302.06675

Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT.

It's 11 lines of pseudo-code (shorter than AdamW)

Possible Implementation

No response

ToucheSir commented 1 year ago

Note that subsequent research has shown marginal at best improvements over Adam(W) with more rigorous experimental design. Nevertheless, this should be a straightforward addition if anyone is interested in getting their feet wet with a PR.

chengchingwen commented 1 year ago

Isn't it already done #129 ?

ToucheSir commented 1 year ago

You're right, I completely forgot about that. Thanks Peter!

PallHaraldsson commented 1 year ago

It seems Lion is not documented (nor implemented) at Flux.jl, nor here?

https://fluxml.ai/Flux.jl/stable/training/optimisers/

https://github.com/FluxML/Flux.jl/blob/134882831277844cfab81f2e6ef393634b4215ec/src/optimise/Optimise.jl#L7

I recall looking for it in code, not finding, then for Adam finding "AdamW,RAdam" so I thought I was in the right place ("if not list all there, then more optimizers, such a Lion implemented in .."). Did optimizers belong originally in Flux.jl then moved out to a new package? Or well reexported in Flux.jl for compatibility (I can understand that).

In general do you think you have the best optimizers implemented (somewhere)?

[I know were activation functions are, it seems squareplus is not implemented (which seems like a good softplus alternative), I could, or add it to my NNLib.jl issue. I also think FlashAttention is missing and its improved version 2.]

chengchingwen commented 1 year ago

Lion is implemented here (Optimisers.jl). I believe the optimiser/Optimise.jl in Flux.jl is somehow out-dated and should be ignored.

mcabbott commented 1 year ago

Or well reexported in Flux.jl for compatibility

At present this is a little complicated. Flux still exports its own (optimiser/Optimise.jl) optimisers. But has methods to auto-translate them to their Optimisers.jl equivalents. The hope is to delete all of that soon -- perhaps https://github.com/FluxML/Flux.jl/issues/1986 is the issue.

Having Flux re-export any newly added rules (for which it has no old equivalents, like Lion) would be fine. They could be temporarily included in the docs. Or perhaps simpler, some note to look at Optimisers.jl for more could be added somewhere.

ToucheSir commented 1 year ago

There is indeed such a note in https://fluxml.ai/Flux.jl/stable/training/optimisers/. We'd want to make the preceding paragraph more strongly worded however, as I think the replacement is basically done and no longer "gradual".

Now, one thing I did notice is that Lion is not currently included in the Optimisers.jl docs build. That should be a simple enough fix.