Is your feature request related to a problem? Please describe.
The current implementation of optimizers does not fairly generalize to neural networks.
Reason: They are hard-coded to work with a single parameter at a time and so are most of the loss functions.
Describe the solution you'd like
We should implement the pre-existing optimizers from scratch (not really scratch though) in the folder optim by subclassing the base Optimizer with their sole purpose being updating the parameters passed to them. Most (read: almost all) optimizers are already implemented inside optimizers.py so they could be used as reference for understanding how a core mathematical idea translates to code.
But about gradients then? autograd will handle them.
Describe alternatives you've considered
We can modify the loss functions to return the gradients manually to the optimizer for all parameters.. but this will require further abstraction of the iterate(...) method of optimizers and modifying the loss functions in such a way would be very time and resource consuming that it might not be worth the effort because we already have automatic differentiation in place to compute those derivatives.
Approach to be followed
Read the available implementations in optim folder. Here's a description of what's been done in case of SGD
The __init__(...) function accepts a python iterable object which contains the parameters to be updated and learning rate as lr.
The step(...) method iterates over all the parameters which were passed during the instantiation of the optimizer and performs the update according to the rule P = P - learning_rate * dL/dP where L is the final loss and P is a parameter.
By now, it must be clear that every optimizer is not as simple as SGD. So how they should actually be implemented?
The __init__(...) function:
Must accept parameters and a _learningrate obviously.
All hyper-parameters must be initialized/taken into account here. For example, in case of SGDWithMomentum we require a hyper-parameter beta which should be accepted in __init__(...) and stored in class instance in form of an attribute because we need it later to perform our update; similarly, for Adam we will be needing beta1, beta2 and epsilon.
About weight decays and stuff? Sure, but they are not the focus here.. once all the optimizers are successfully ported we can have a separate issue regarding these.
The "Extra Super Important Variables" which help to perform the update should also be initialized here. For example, for SGDWithMomentum, we need to initialize V for every parameter to zero and then update and use the V corresponding to different parameters to update them; similarly, for Adam we need to initialize S_t and V_t in a similar fashion to perform the desired update on parameters.
The step(...) function:
Performs the gradient update. Or to put it simply, this is the arena of mathematical equations used to update the parameters. The gradient with respect to the final loss are stored in param.grad.
The actual update would be performed on param.data instead of param using param.gradient.data and not param.gradient.. why so? Well the actual data is stored in data attribute of Tensor. When we perform operations on Tensor, they get added to the computation graph or to put it simply, these operations are recorded to be used during back-propagation. Since we are just updating the param we don't want it to be used during back-propagation, so we perform the operation directly on param.data and param.grad.data (param and param.grad are both Tensors).
NOTE: The Cheat Sheet at the bottom of the page (mentioned at top) lists the equations in reverse order.
A NOTE ABOUT NESTEROV ACCELERATED GRADIENT algorithm:
Most libraries implement this by storing the look-ahead weights so that the update looks more like Gradient Descent, we can also follow this route. If this didn't make sense "Don't worry about it!"
Additional context
Neural Nets depend entirely on optimizers for learning, having various optimizers available at hand would be great feat for the library.
Is your feature request related to a problem? Please describe. The current implementation of optimizers does not fairly generalize to neural networks. Reason: They are hard-coded to work with a single parameter at a time and so are most of the loss functions.
Describe the solution you'd like We should implement the pre-existing optimizers from scratch (not really scratch though) in the folder optim by subclassing the base
Optimizer
with their sole purpose being updating the parameters passed to them. Most (read: almost all) optimizers are already implemented insideoptimizers.py
so they could be used as reference for understanding how a core mathematical idea translates to code. But about gradients then? autograd will handle them.Describe alternatives you've considered We can modify the loss functions to return the gradients manually to the optimizer for all parameters.. but this will require further abstraction of the
iterate(...)
method of optimizers and modifying the loss functions in such a way would be very time and resource consuming that it might not be worth the effort because we already have automatic differentiation in place to compute those derivatives.Approach to be followed
Read the available implementations in
optim
folder. Here's a description of what's been done in case ofSGD
__init__(...)
function accepts a python iterable object which contains the parameters to be updated and learning rate aslr
.step(...)
method iterates over all the parameters which were passed during the instantiation of the optimizer and performs the update according to the ruleP = P - learning_rate * dL/dP
where L is the final loss and P is a parameter.Here's a list of (most of the) optimizing algorithms used for Neural Networks: https://www.kdnuggets.com/2019/06/gradient-descent-algorithms-cheat-sheet.html. You can read the mathematical equations listed here and what if better intuitive understanding is desired? YouTube!
By now, it must be clear that every optimizer is not as simple as SGD. So how they should actually be implemented?
The
__init__(...)
function:__init__(...)
and stored in class instance in form of an attribute because we need it later to perform our update; similarly, for Adam we will be needing beta1, beta2 and epsilon.S_t
andV_t
in a similar fashion to perform the desired update on parameters.The
step(...)
function:data
attribute of Tensor. When we perform operations on Tensor, they get added to the computation graph or to put it simply, these operations are recorded to be used during back-propagation. Since we are just updating theparam
we don't want it to be used during back-propagation, so we perform the operation directly on param.data and param.grad.data (param and param.grad are both Tensors).A NOTE ABOUT NESTEROV ACCELERATED GRADIENT algorithm:
Additional context Neural Nets depend entirely on optimizers for learning, having various optimizers available at hand would be great feat for the library.