dfdazac / machine-learning-1

Code and notebooks on machine learning
MIT License
3 stars 5 forks source link

Is that really SGD? #1

Open romilly opened 3 years ago

romilly commented 3 years ago

I may be missing something but it looks as if you're doing Full Gradient Descent (i.e. using the entire batch) in the SGD class. SGD should just use a single value selected at random.

dfdazac commented 3 years ago

Hi @romilly, thanks for the question. The SGD class that I implemented is mainly concerned with the update rule, implemented in the step function:

https://github.com/dfdazac/machine-learning-1/blob/0beb7c098aa8b16689075822f76bc1b4fd38dedf/neural_networks/optimizers.py#L11-L14

The SGD class is intended to encapsulate this, which differs from rules of other optimizers that, for example, add momentum, or per-parameter learning rates.

This is also a very specific optimizer because I explicitly designed to optimize the weights in a linear layer, using the gradients in layer.dW and layer.db. Even then, the class can be used independent of the number of samples, because the gradients dW and db always have the same shape of the parameters, and they are computed in a separate class (NNClassifier where they can be computed with one sample, a mini-batch of samples, or even the full training set.