HIPS / autograd

Efficiently computes derivatives of NumPy code.
MIT License
7k stars 912 forks source link

SGD w/ momentum #289

Open bkj opened 7 years ago

bkj commented 7 years ago

Hi --

I'm wondering about the implementation of SGD w/ momentum in autograd.optimizers.sgd:

velocity = momentum * velocity - (1.0 - momentum) * g
x = x + learning_rate * velocity

Other places I've looked have a different version of momentum -- CS231 from Stanford and the NAG paper [1] have:

velocity = momentum * velocity - learning_rate * g
x = x + velocity

EDIT: pytorch implements something closer to autograd, but definitely still different:

velocity = mass * velocity + g
x = x - learning_rate * velocity

Does anyone have any thoughts on why these might be different? AFAICT they are meaningfully different, so not sure which one is correct.

Thanks Ben

[1] http://www.cs.toronto.edu/~hinton/absps/momentum.pdf

mattjj commented 7 years ago

Good question! I think this just comes down to parameterization, by which I mean we can produce the same iterates using any of these versions by setting the (hyper)parameters differently.

For example, take the autograd version:

velocity = momentum * velocity - (1.0 - momentum) * g
x = x + learning_rate * velocity

We could move the learning_rate scale into the velocity sequence, to yield

lr_times_velocity = momentum * lr_times_velocity - (learning_rate * (1.0 - momentum)) * g
x = x + lr_times_velocity

Now by choosing learning_rate and momentum appropriately we can simulate the CS231/NAG version (because the map (x, y) -> (x, x(1-y)) is invertible).

You can do something similar with the PyTorch version.

EDIT: fixed math bugs! That's the problem with doing math at the end of the day...

mattjj commented 7 years ago

Here's a quick check that my debug math worked out:

import numpy as np

gs = np.random.randn(10)

def cs231_sgd(x0, v0, learning_rate, momentum):
    x = x0
    velocity = v0
    for g in gs:
        velocity = momentum * velocity - learning_rate * g
        x = x + velocity
        print x
    print

def autograd_sgd(x0, v0, learning_rate, momentum):
    x = x0
    velocity = v0
    for g in gs:
        velocity = momentum * velocity - (1.0 - momentum) * g
        x = x + learning_rate * velocity
        print x
    print

cs231_sgd(0., 0., 0.5, 0.9)
autograd_sgd(0., 0., 0.5 / (1. - 0.9), 0.9)

EDIT: fixed calls!

bkj commented 7 years ago

Ah great, thanks for the answer.

Looks like you have the calls flipped -- should be

cs231_sgd(0., 0., 0.5, 0.9)
autograd_sgd(0., 0., 0.5  / (1. - 0.9), 0.9)

to yield the same answers.

As a general point, I'd contend that a parameterization where LRs are > 1 is a little confusing -- it doesn't really jive with my intuition of what a learning rate is.

Actually part of the reason I was looking into this was that in Figure 2 of this paper [1], they show that the optimal learning rate schedule for a simple NN has values as high as LR=7.0 at certain points, but looks like under their parameterization this probably translates to ~0.7 under SGD implementations I'm more familiar with. (Those are still high LRs, but not 7.0!)

Thanks again!

[1] https://arxiv.org/pdf/1502.03492.pdf

mattjj commented 7 years ago

Thanks for the catch on my flipped calls :)

Good point about learning rates. It'd be good to follow some standard so that people can transfer learning rate values without too much confusion. It looks like TensorFlow does the same thing as PyTorch.

@dougalm or @duvenaud, should we switch the optimizer(s) to use a different hyperparameter convention to match other libraries?