There are quite a few post update calculation transforms or modifiers that are generally useful in stabilizing the gradient descent algorithm. One such example being Stochastic Weight Averaging (SWA) and methods in the family of averaging results. The other would be lookahead.
There needs to be a general action site for these methods to act upon, so as to make it easy to extend if we need to add a new method to act, or easy to apply these to new-er optimizers that don't support it yet.
In the process of adding these updates, I don't believe we would be creating the most optimal implementation possible in memory or speed requirements, but making it usable albeit at the cost of being slower and heavier is still better than having to spend hours trying to integrate into newer optimizers.
As much as possible, I want to ensure we have it out and ready-to-use as fast as possible, and then keep the optimization of our implementation as a future work.
There are quite a few post update calculation transforms or modifiers that are generally useful in stabilizing the gradient descent algorithm. One such example being Stochastic Weight Averaging (SWA) and methods in the family of averaging results. The other would be lookahead.
There needs to be a general action site for these methods to act upon, so as to make it easy to extend if we need to add a new method to act, or easy to apply these to new-er optimizers that don't support it yet.
In the process of adding these updates, I don't believe we would be creating the most optimal implementation possible in memory or speed requirements, but making it usable albeit at the cost of being slower and heavier is still better than having to spend hours trying to integrate into newer optimizers.
As much as possible, I want to ensure we have it out and ready-to-use as fast as possible, and then keep the optimization of our implementation as a future work.