Open bkj opened 7 years ago
Good question! I think this just comes down to parameterization, by which I mean we can produce the same iterates using any of these versions by setting the (hyper)parameters differently.
For example, take the autograd version:
velocity = momentum * velocity - (1.0 - momentum) * g
x = x + learning_rate * velocity
We could move the learning_rate
scale into the velocity sequence, to yield
lr_times_velocity = momentum * lr_times_velocity - (learning_rate * (1.0 - momentum)) * g
x = x + lr_times_velocity
Now by choosing learning_rate
and momentum
appropriately we can simulate the CS231/NAG version (because the map (x, y) -> (x, x(1-y)) is invertible).
You can do something similar with the PyTorch version.
EDIT: fixed math bugs! That's the problem with doing math at the end of the day...
Here's a quick check that my debug math worked out:
import numpy as np
gs = np.random.randn(10)
def cs231_sgd(x0, v0, learning_rate, momentum):
x = x0
velocity = v0
for g in gs:
velocity = momentum * velocity - learning_rate * g
x = x + velocity
print x
print
def autograd_sgd(x0, v0, learning_rate, momentum):
x = x0
velocity = v0
for g in gs:
velocity = momentum * velocity - (1.0 - momentum) * g
x = x + learning_rate * velocity
print x
print
cs231_sgd(0., 0., 0.5, 0.9)
autograd_sgd(0., 0., 0.5 / (1. - 0.9), 0.9)
EDIT: fixed calls!
Ah great, thanks for the answer.
Looks like you have the calls flipped -- should be
cs231_sgd(0., 0., 0.5, 0.9)
autograd_sgd(0., 0., 0.5 / (1. - 0.9), 0.9)
to yield the same answers.
As a general point, I'd contend that a parameterization where LRs are > 1 is a little confusing -- it doesn't really jive with my intuition of what a learning rate is.
Actually part of the reason I was looking into this was that in Figure 2 of this paper [1], they show that the optimal learning rate schedule for a simple NN has values as high as LR=7.0 at certain points, but looks like under their parameterization this probably translates to ~0.7 under SGD implementations I'm more familiar with. (Those are still high LRs, but not 7.0!)
Thanks again!
Thanks for the catch on my flipped calls :)
Good point about learning rates. It'd be good to follow some standard so that people can transfer learning rate values without too much confusion. It looks like TensorFlow does the same thing as PyTorch.
@dougalm or @duvenaud, should we switch the optimizer(s) to use a different hyperparameter convention to match other libraries?
Hi --
I'm wondering about the implementation of SGD w/ momentum in
autograd.optimizers.sgd
:Other places I've looked have a different version of momentum -- CS231 from Stanford and the NAG paper [1] have:
EDIT: pytorch implements something closer to autograd, but definitely still different:
Does anyone have any thoughts on why these might be different? AFAICT they are meaningfully different, so not sure which one is correct.
Thanks Ben
[1] http://www.cs.toronto.edu/~hinton/absps/momentum.pdf