Currently, our model always gives past gradients the same weight as future gradients, regardless of where the model is in the current training. So, if it has already completed a few hundred thousand steps, the optimizer would keep looking at the best 100 steps to compute the approximate update direction.\
Instead, we could add the learning rate scheduler directly before passing it into the optimizer. This way, our optimizer would slowly deprioritize the current gradient and focus more on the past ones. Our learning rate would almost act like an implicit beta schedule which could make tuning this hyperparameter easier.
Currently, our model always gives past gradients the same weight as future gradients, regardless of where the model is in the current training. So, if it has already completed a few hundred thousand steps, the optimizer would keep looking at the best 100 steps to compute the approximate update direction.\ Instead, we could add the learning rate scheduler directly before passing it into the optimizer. This way, our optimizer would slowly deprioritize the current gradient and focus more on the past ones. Our learning rate would almost act like an implicit beta schedule which could make tuning this hyperparameter easier.