Open vroulet opened 5 months ago
Well, not by me/us. I think @cortner implemented it and wrote those words. I think it is supposed to mean that MoreThuente and HagerZhang er mostly meant to achieve the same thing as StrongWolfe, but using more efficient algoriths/step size proposals.
MoreThuente and HagerZhang both implement variants of zhe Wolfe conditions (I think strong but I don't remember). Both are more sophisticated, both more performant and in particular HZ is numerically more stable on the regime of small changes to the objective when rounding starts kicking in.
Arguably "StrongWolfe" is badly named. Something like "NaiveBracketing" might be better.
For many but not all (or even most?) real-world problems the Wolfe conditions are not needed and slow down the optimization significantly, hence BackTracking was mentioned here as well. Backtracking with the right interpolation will satisfy Wolfe "often".
There is no single best algorithm. One needs to find out what the best algorithm for specific problem classes is.
But I see you worked with Francis, so none of this should be a surprise for you?
I've worked with Francis more on convex problems, and at that time if I needed to use lbfgs I would have used Mark Schmidt's implementation.
I was asking because we wanted to have an implementation of lbfgs in jax. I followed Nocedal and Wright who present this "strongwolfe" implementation (which they call a zoom linesearch). But the comments here cast doubt on that implementation. And, as you said, at the end a backtracking linesearch may probably be a reasonable default linesearch.
By curiosity what do you mean by backtracking "with the right interpolation"?
And thank yo both @cortner @pkofod for answering this question!
I followed Nocedal and Wright
I recomment to follow Hager-Zhang instead for the most robust algorithm.
And, as you said, at the end a backtracking linesearch may probably be a reasonable default linesearch.
possibly with a restart heuristic...
By curiosity what do you mean by backtracking "with the right interpolation"?
Again can be problem dependent. For the kind of problems I've been working on conditioning is an issue but good enough preconditioners are normally easy to applied. Cost of objective and gradient is the killer. In that scenario I have found that quadratic interpolation from current objective and grad, and future objective value (gives three values, hence a quadratic) gives a great one-dimensional model to minimize in the line-search. What turns out to be even more important is to also use a good heuristic to estimate the best step-length for the next iteration in order to avoid line-search altogether whenever you can. All this is implemented in Linesearches.jl and Optim.jl.
Hello,
Thank you for all your work! I was wondering why in the docstring of the strong wolfe linesearch the authors say the following.
Has there been a benchmark that supports this claim in terms of function evaluations and overall performance when combined with e.g. LBFGS?