HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Learning-rate schedule as beta schedule #61

Closed ClashLuke closed 2 years ago

ClashLuke commented 2 years ago

Currently, our model always gives past gradients the same weight as future gradients, regardless of where the model is in the current training. So, if it has already completed a few hundred thousand steps, the optimizer would keep looking at the best 100 steps to compute the approximate update direction.\ Instead, we could add the learning rate scheduler directly before passing it into the optimizer. This way, our optimizer would slowly deprioritize the current gradient and focus more on the past ones. Our learning rate would almost act like an implicit beta schedule which could make tuning this hyperparameter easier.

ClashLuke commented 2 years ago

Worse than baseline. Adam: 1.20 -> 1.23 Shampoo: 1.18 -> 1.23