Why saddle-free Newton method did not get popularity?

JarekDuda commented 5 years ago

Hi, I am convinced that we should go to 2nd order methods, those which actively repel saddles like SFN, but it seems they are not used (?) nearly 5 years after the paper. Do you maybe know what is the problem? How your implementation compares with other popular methods? Best, Jarek Duda ps. Just started discussion about it: https://redd.it/bbw51k

dave-fernandes commented 5 years ago

Hi Jarek, If you look at the XOR problem, the SF optimizer converges much, much faster than 1st order methods. But on realistic problems, it does not. I think the 2nd order approximation is just not good enough to navigate the valleys in the loss function. Higher order terms are more important. And if you use Relu activations, then the Hessian is not at all predictive of places where the gradient will suddenly change. See this paper where they use finite differences to measure curvature rather than the Hessian: https://arxiv.org/abs/1811.09716

Saddle points also do not seem to be that big of a problem, though I don't know of a paper that shows this for certain. Well chosen initial conditions and first order methods with momentum still seem to be the state of the art.

There is still work being done on 2nd order methods. See K-FAC: https://arxiv.org/abs/1503.05671 https://github.com/tensorflow/kfac (I believe this optimizer only works for fully-connected networks so far.)

JarekDuda commented 5 years ago

Hi Dave, Thanks, I will have to read the papers. ReLu indeed seems problematic for 2nd order, but maybe it is smoothing over minibatches into something what can be locally modeled with 2nd degree polynomial (?) The original SFN paper claims great improvement, that other methods got stuck in suboptimal solutions - probably some plateaus. Here is another interesting paper which Fig. 6 shows that rare negative curvature directions lead to relatively huge improvement: https://arxiv.org/abs/1902.02366

Maybe it is just a matter of improving SFN-like approach? For example ( https://arxiv.org/pdf/1901.11457 ):

Direct estimation of Hessian from noisy data seems problematic. We are in fact interested in linear behavior of 1st derivative - which can be optimally estimated with least squares linear regression of gradients, with weakening weights of old gradients,
We need to restrict to a subspace, but Krylov doesn't seem the optimal way, only convenient for Lanczos. We should instead extract statistics of recent gradients like using subspace from their PCA,
the method should be more "online" - trying to exploit local situation.

dave-fernandes commented 5 years ago

Hi Jarek,

The original SFN paper claims great improvement, that other methods got stuck in suboptimal solutions - probably some plateaus.

Well, they claim an improvement for training loss over HF, but the validation loss is already way higher than the training loss. So from a practical point of view, there is no difference.

Here is another interesting paper which Fig. 6 shows that rare negative curvature directions lead to relatively huge improvement: https://arxiv.org/abs/1902.02366

Thanks! Interesting paper. I'll have to follow up on this and see if changing the handling of negative eigenvalues makes a difference in the SF method.

Maybe it is just a matter of improving SFN-like approach? For example ( https://arxiv.org/pdf/1901.11457 ):

Direct estimation of Hessian from noisy data seems problematic. We are in fact interested in linear behavior of 1st derivative - which can be optimally estimated with least squares linear regression of gradients, with weakening weights of old gradients,

We need to restrict to a subspace, but Krylov doesn't seem the optimal way, only convenient for Lanczos. We should instead extract statistics of recent gradients like using subspace from their PCA,

the method should be more "online" - trying to exploit local situation.

Interesting ideas. Can you show theoretically how the expectation value for your curvature estimate compares to other quantities used in the literature? Or experimentally compare them?

JarekDuda commented 5 years ago

Hi Dave, Handling saddles allows to reach local minimum, but indeed the big question is generalization - it often leads to overfitting. But generally saddle repulsion is only an addition for 2nd order methods - which also e.g. allow for smarter choice of step size and optimizing in multiple directions simultaneously.

Regarding Hessian estimation, I have seen in literature basically 3 approaches: Gauss-Newton-like approximation (pretending that function is locally convex), direct backpropagation of 2nd derivatives, and from differences of gradients (Hesssian-free Newton) - am I missing something?

The last can be seen as gradient linear regression from two points, what seems problematic for noisy gradients. Least square linear regression allows to analogously extract this statistical trend from many noisy gradients, and can be made online to represent local situation: by reducing weights of old gradients with exponential moving average.

Direct backpropagation seems to need large minibatch to get a decent accuracy of H^-1, especially for lambda~0. To split it into small steps to exploit local situation, I don't see a better way than least square linear regression of gradients? The question is how beneficial is dependence on local situation - is it worth to go to online methods?

JarekDuda commented 5 years ago

Regarding the choice of statistically relevant subspace for modeling Hessian, turns out there are inexpensive online-PCA methods: https://arxiv.org/abs/1307.0032

Mine is simpler/cheaper: just add to all vectors of current basis part of gradient orthogonal to current subspace, and maintain orthonormalization (just improved in updated arxiv).

But if it is not good enough, one can use such online PCA.

To be less dependent on this choice of subspace, we can use hybrid method: 2nd order in the modeled subspace ... and we can simultaneously do 1st order for free: e.g. just gradient descent using part of the gradient orthogonal to modeled subspace.

With least square linear regression we could even do hybrid with 3rd order model - e.g. just for a single most interesting direction.

dave-fernandes commented 5 years ago

@JarekDuda You will have to create a proof of concept and show that it works and scales well to convince people.

JarekDuda commented 5 years ago

Estimating linear trend with just difference of two noisy values, is definitely less accurate than using least squares linear regression instead - I don't think it needs a convincing, literature search strongly suggests that it wasn't used for estimating Hessian in SGD methods before - trying such upgrade in current 2nd order methods is itself worth investigating. Also, you can add gradient descent to your implementation - 2nd order method only operates in (Krylov) subspace, you can subtract gradient in the remaining directions practically for free. The full method is quite complex, has many hyperparameters, and I don't have experience - I will probably try digging there, but definitely not before the summer break.

dave-fernandes commented 5 years ago

Estimating linear trend with just difference of two noisy values, is definitely less accurate than using least squares linear regression instead - I don't think it needs a convincing, literature search strongly suggests that it wasn't used for estimating Hessian in SGD methods before - trying such upgrade in current 2nd order methods is itself worth investigating.

The current method does not use differences at all. It uses two reverse-mode automatic differentiation passes to compute the Hessian-vector product at a point.

Also, you can add gradient descent to your implementation - 2nd order method only operates in (Krylov) subspace, you can subtract gradient in the remaining directions practically for free.

This is already part of the method. The gradient direction is the first basis vector in the Krylov subspace.

The full method is quite complex, has many hyperparameters, and I don't have experience - I will probably try digging there, but definitely not before the summer break.

JarekDuda commented 5 years ago

Sure, SFN uses backpropagation of second derivatives, but it needs stopping and going through a large part of dataset for a decent accuracy, and it still requires line search - what indicates being very inaccurate.

Difference of gradients is used e.g. in Hessian-free Newton, I am only saying that much more accurate way to extract linear trend is MSE fit from multiple points, than from just two points - which can be seen as a special case.

Additionally, we can cheaply do it online: update four (exponential moving) averages: of g, x, gx, x^2, weakening weights of the old gradients.

The question is if online 2nd order methods can be more succesful: carefully exploiting local situations instead of making big steps? Do you know any online 2nd order methods?

dave-fernandes commented 5 years ago

Sure, SFN uses backpropagation of second derivatives, but it needs stopping and going through a large part of dataset for a decent accuracy, and it still requires line search - what indicates being very inaccurate.

The HF method uses line search (this is required to satisfy the conjugate-gradient method's way of constructing orthogonal basis vectors). The SF method doesn't do line search. But, yes, they do require substantial mini-batch sizes.

Difference of gradients is used e.g. in Hessian-free Newton, I am only saying that much more accurate way to extract linear trend is MSE fit from multiple points, than from just two points - which can be seen as a special case.

The HF method uses autodiff just like SF method. It doesn't use gradient differences. See Martens (2010):

While the product Hd can be computed using finite- differences, this approach is subject to numerical problems and also requires the computationally expensive evaluation of non-linear functions. Pearlmutter (1994) showed that there is an efficient procedure for computing the product Hd exactly for neural networks and several other models such as RNNs and Boltzmann machines. This algorithm is like backprop as it involves a forward and backward pass, is “local”, and has a similar computational cost. Moreover, for standard neural nets it can also be performed without the need to evaluate non-linear functions.

The question is if online 2nd order methods can be more succesful: carefully exploiting local situations instead of making big steps? Do you know any online 2nd order methods?

I don't know of any. Please let us know if it works!

JarekDuda commented 5 years ago

Oh I see, HF paper writes finite differences on 3rd page, but I haven't noticed it finally uses back-propagation instead on 4th page. Thanks for clarifying.

Regarding SF, it has this "argmin_lamda f" at the bottom of pseudocode. It is not literally a line search, but seems to have similar meaning and cost (?) Careful online method shouldn't need anything like that.

I might decide working on that having more time during the summer break if exhausting more interesting topics, but I am rather an information theory person ( http://th.if.uj.edu.pl/~dudaj/ ) without experience in implementing high cost ML, working on laptop with integrated graphics ...

dave-fernandes / SaddleFreeOptimizer

Why saddle-free Newton method did not get popularity? #1