Closed mashu closed 1 year ago
@chengchingwen I amended the commit so it should now reflect your suggestions.
For easier reference the paper's description (including weight decay) is:
Later, in the appendix, they have pseudo-code without weight decay:
Note that this needs m_{t-1}
but never c_{t-1}
, which is why they advertise it as needing to store fewer arrays than Adam, i.e. like one not two copies of the parameters:
Looks like it's ready to go?
Implementation of Lion optimiser which is faster than AdamW from Symbolic Discovery of Optimization Algorithms