distillpub / post--momentum

Why Momentum Really Works
https://distill.pub/2017/momentum/
Creative Commons Attribution 4.0 International
198 stars 51 forks source link

Peer Review Report 1 -- Matt Hoffman #29

Closed colah closed 7 years ago

colah commented 7 years ago

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer, Matt Hoffman, for taking the time to write such a thorough review.


A note on the relevance of convergence rates.

This article does not include the word “stochastic” anywhere. This is a problem, because

  1. In machine learning (and probably beyond), momentum is almost always applied to noisy gradients. (Quasi-Newton methods like LBFGS are usually more effective than SGD with momentum in the noise-free setting.)

  2. The theoretical results about momentum that the article presents do not apply in the presence of gradient noise.

That is, to the extent that momentum is useful in modern machine learning, it’s not because it yields better convergence rates. It’s because it makes it possible to speed up SGD’s lengthy initial search process. (This is noted in [1], the Sutskever et al. article.)

That’s not to say that the intuitions developed in this article aren’t relevant to the stochastic setting. They are, especially early in optimization when the magnitude of the gradient noise may be small relative to the magnitude of the oscillations due to using a large step size.

But it’s important to be clear that the convergence rate results are at best suggestive—they probably don’t apply in most practical applications.

Suggestions/observations on figures:

Teaser figure:

“Decomposing the error”:

“Example - Polynomial Regression”:

Comments on the text:

Content:

Typos etc.:

gabgoh commented 7 years ago

Thank you very much Matt for the useful comments!

I will formulate a more detailed response soon. The stylistic tips are appreciated, and I will consider them carefully.

A quick response to the main point on stochastic approximation. As you point out, these results are still useful in early optimization. Where I disagree is that these rates do not hold at all in the stochastic setting. These rates still hold in the early parts of optimization, but not asymptotically. There is a detailed analysis of this in Section 4 of Flammarion and Bach, https://arxiv.org/abs/1504.01577. The gist of it is that these rates still hold in expectation, up to the point of it stalling, see eq 13 for the non strongly convex case.

It might be worthwhile talking about this in the article. What are your thoughts Chris/Shan?

colah commented 7 years ago

It might be worthwhile talking about this in the article. What are your thoughts Chris/Shan?

It seems like you could address this issue with a paragraph or two. For example, you could acknowledge that you're exploring a limited case -- quadratics without gradient noise -- where it's possible to get theoretical traction, and then explain that you should take these results with a grain of salt, but there's reasons to think a lot of intuition may transfer over.

gabgoh commented 7 years ago

This article does not include the word “stochastic” anywhere. This is a problem, because In machine learning (and probably beyond), momentum is almost always applied to noisy gradients ... But it’s important to be clear that the convergence rate results are at best suggestive—they probably don’t apply in most practical applications.

I disagree with the premise of this objection. As I have noted before, non asymptotic, expected convergence rates can be derived at all points in the optimization. I have added a new section on SGD which hopefully addresses some of these concerns.

The most interesting stuff happens when beta is between 0.9 and 0.99, but it’s hard to get fine control in that region.

I agree with the sentiment, though I have not found a good solution to this problem.

It would be nice to be able to set the step size high enough that the dynamics actually explode.

This has been fixed.

If the largest eigenvalue is 100, shouldn’t the maximum stable step size be 0.02 instead of 2?

Thank you! This has been fixed.

Scrubbing up and down feels more natural to me than left to right, especially for the first.

I have found the current scrubbing system to be more natural personally.

“regions of f which aren’t scaled properly”, “In these unfortunate regions…”: I think “directions” is probably a better word than “regions” here. “Regions” implies that there are places where it’s not a problem.

I did intend to use the words "Regions". Pathological curvature is not a global phenomena, but a local one. It is easy to construct functions where the condition number of the hessian depends strongly on where the function is evaluated.

“The iterates either jump between valleys”: What’s that mean? Are there multiple valleys? (I assume this is meant to describe wild oscillatory behavior.)

For an n dimensional function, there can be indeed be up to (n - 1) valleys, one for each pathological direction.

“The change is innocent”: Innocent of what?

I do not understand this point.

“But the truth, if anything, is the other way round. It is gradient descent which is the hack.” In what sense is GD “hackier" than GD+momentum? Working less well than an alternative does not make a method a hack.

My point here is largely rhetorical. It is worth noting, however that gradient descent is not an alternative, but a special case of momentum, where beta = 0. The hack I refer to is the oversight in GD which constrains it to a 1 term recurrence.

“this is the speedup you get from the Fast Fourier Transform”: To be pedantic, the FFT is O(n log n), not O(n).

Thank you. I have changed this to "similar to the speedup"

“think of A as your favorite model of curvature - the Hessian, Fisher Information Matrix [5], etc”: I know what you mean here, but not every reader will.

I have decided to grant the reader the benefit of the doubt here.

“captures all the key features of pathological curvature”: To the extent that you define curvature in terms of the Hessian, this is tautological. But “all the key features” sounds like studying quadratics is sufficient, even though it can’t teach you anything about other important issues (e.g., vanishing/exploding gradients, saddle points, etc.).

I have been deliberately vague about what pathological curvature is, so I do not believe this to be tautological. It is indeed true that there are aberrations of gradient descent not explained by quadratics, but I do not think clause leads the reader to believe that is so.

“the eigenvalues of A”: This is imprecise—the eigenvalues aren’t a space.

This has been fixed.

In “Choosing A Step-size”, sometimes the smallest eigenvalue is lambda_0 instead of lambda_1.

Thank you. This has been fixed.

“It is a measure of how singular a matrix is.”: “how singular” should be “how close to singular”—a matrix is either singular or it isn’t.

Fixed

“The above analysis reveals an interesting insight”, “Surprisingly, in many applications”, etc.: These sentences tell the reader how to feel, which bugs me. IMO, it’s best to let the reader decide whether something is surprising, interesting, remarkable, etc. (I know many people who agree. Of course, people still do it: https://nsaunders.wordpress.com/2013/07/16/interestingly-the-sentence-adverbs-of-pubmed-central/)

I have decided to tone down the use of adverbs, though I have not eliminated the use of it. These words do help emphasize important points, when used sparingly.

“starting at a simple initial point like 0 (call this a prior, if you like)”: I do not. A prior is a distribution, not an initialization. The point that shrinking towards 0 (either using Tikhonov regularization or early stopping) mimics maximum a-posteriori estimation with a normal prior is valid, but the initial point 0 is not a “prior”.

In consideration of your objection, I have changed the phrase to "by a gross abuse of language, let’s think of this as a prior"

The dynamics of momentum:

Again, it’d be nice to be specific about where Q came from. When using notation for the first time in a little while, it’s always good to remind the reader what it means.

I have now made it a point to remind the reader of where the Q's come from.

“Momentum allows us to use a crank up the step-size up by a factor of 2 before diverging.”: It took me a moment to figure out how the figure demonstrated this—it might be good to point the reader to the upper right corner of the figure.

I have added more detail on this subject, and an explicit formula for the range of permissible step-sizes

“Plug this into the convergence rate, and you get”: I think the labels for the convergence rates are backwards.

This has been fixed

“Being at the knife’s edge of divergence, like in gradient descent, is a good place to be.” This is true for quadratics, but for non-convex problems it’s not always good advice. It’s hard to say why in general, but my (very hand-wavy) intuition is that using a very large step size and momentum quickly breaks near-symmetries in your random initialization, which makes SGD behave more greedily than it might otherwise.

This is an interesting observation, and Chris Olah has raised similar concerns. I believe the counterintuitive nature of these best practices is something worthy of further attention.

Typos etc.:

Thank you for the detailed feedback. I have fixed all the typos mentioned.

“the Convex Rosenbrok”: Rosenbrok => Rosenbrock. Also, is this terminology novel?

I do not believe so, though I believe this is a verbal tradition.