Line search, BFGS and the Wolfe conditions

bsxfan commented 11 years ago

Hi all

I have a few years of experience with implementing (I like to write my own code) and using various large-scale optimization algorithms in MATLAB. I have worked with non-linear conjugate gradient, and various versions of quasi and truncated Newton solvers, using both line-search and trust-region variants.

In my experience, the line-search component of an optimizer is very important, but one very often sees this neglected. The line-search balances two conflicting requirements:

It should not try to locate the minimum along the search direction too precisely, because it could do much better going in a new direction.
It should however make at least SOME progress---this can be measured for example via the Wolfe conditions.

In the BFGS/LBFGS algorithms, the line-search is of particular importance, because if the line search fails to meet the Wolfe-conditions, the inverse Hessian approximation can end up being non-positive definite, which will send the optimizer going uphill instead of downhill. A nice line search should satisfy these conditions.

If anyone feels like translating from MATLAB, a version of my favourite line search is available here: http://www.cs.umd.edu/users/oleary/software/ You will have to translate both cvsrch.m and cstep.m.

I have been using this implemention in my own optimizers for a while now and I find the extra complication was well worth it. This line search also helps to make the optimizer deal more efficiently with non-convex regions of the objective function.

As a further recommendation for this line-search algorithm, it is used in Poblano: https://software.sandia.gov/trac/poblano/

If someone else does not do this translation, I will probably end up doing it myself, but I can't promise when ...

johnmyleswhite commented 11 years ago

Thanks for the reference to the Matlab implementation of the More and Thuente line search. I found a few implementations of it recently, but nothing in Matlab before.

I'll work on translating that code over the weekend assuming the licensing works out.

timholy commented 11 years ago

Is it really clear that these are better than the Hager & Zhang linesearch that's already present in cgdescent? It has many attractive features:

Full machine accuracy in finding the minimum, rather than sqrt(eps) precision
Good quadratic convergence, and smart fallback to bisection
Demonstrated evidence that it increases the performance of L-BFGS
I've debugged it already :-)

timholy commented 11 years ago

Also, a rigorous proof of convergence. I forgot that one.

johnmyleswhite commented 11 years ago

It's not clear at all, which is why I'd like to have both available.

johnmyleswhite commented 11 years ago

Are you saying that the Mure and Thuente proof isn't rigorous?

timholy commented 11 years ago

No (I don't know that one), just that there is a rigorous proof of convergence for the H&Z algorithm. When one uses the approximate Wolfe conditions, until relatively recently the tradition seemed to be "it's been proved for the exact Wolfe conditions, and it's probably true for the approximate as well." But H&Z proved it for approximate.

johnmyleswhite commented 11 years ago

Having looked into this a bit, it looks like the More and Thuente algorithm has, as @timholy suspected, guaranteed convergence only for the strong Wolfe conditions, not the approximate conditions. To provide us with the most options, I've nevertheless translated the BSD version of the More and Thuente code that @bsxfan linked to and have already included it with the latest commit as a possible option. Tomorrow I'll try to adapt the code for the H&Z algorithm for use with all of our algorithms.

timholy commented 11 years ago

Very cool! I suspect M&T will nevertheless work for many (if not all) problems, and it would indeed be good to have a head-to-head comparison against H&Z.

Of course, making that comparison sufficiently comprehensively will be a challenge. Someday we definitely need access to CUTEr. Hmm, that would have been a good GSoC project for someone...

johnmyleswhite commented 11 years ago

Hopefully: M&T right now isn't doing much better than our previous line search algorithms. I need to check for bugs since I'm rather disappointed by M&T.

Agreed. CUTEr would have been a good GSoC project. I looked into writing something to access to CUTEr, but realized that it was much more work than I had originally suspected.

timholy commented 11 years ago

Re CUTEr: yes, I think it would be a substantial commitment, and like you I don't have time for this now either. If someone does tackle it, I'm not sure whether it would be easier to parse the SIF format, or whether it would be easier to wrap the Fortran library and have the objective functions evaluated in Fortran.

johnmyleswhite commented 11 years ago

I've adapted Tim's code for the H&Z line search to work with the new API. The H&Z line search does seem to be the best algorithm we have implemented, although it sometimes fails for the current BFGS code, which occasionally produces search directions that do not make an acute angle with the gradient. The M&T line search also rejects these search directions, so I think I'll transition everything over to the H&Z line search as a default in the near future, while continuing to allow the user to force the use of their preferred line search algorithm.

It'll take me a few more days to structure things in a way that doesn't disturb the existing cgdescent code, while exposing the same functionality under the new API.

timholy commented 11 years ago

This sounds very impressive, John.

The non-acute angle with BFGS suggests that the inverse hessian is not always nonnegative-definite. To me this sounds like a potential bug. See http://math.stackexchange.com/questions/209237/why-does-the-standard-bfgs-update-rule-preserve-positive-definiteness

johnmyleswhite commented 11 years ago

Yes, I agree: there is probably a subtle bug somewhere in the BFGS code.

bsxfan commented 11 years ago

Hi All

Great it is nice to see all the work you did with the line-search algorithms! (The only omission I see in the M&T Julia implementation is that you left out 'the last card of subroutine' comments. I feel those should be there to remind people how old this algorithm is :-).

I agree that when BFGS/LBFGS proposes an uphill direction, this is due to a non-negative definite Hessian approximation. If you like, you can compare your LBFGS with my implementation. The following links point to 4 matlab m-files in dropbox:

http://db.tt/t7xAj7zu http://db.tt/LmztmNM7 http://db.tt/RkD3tTbW http://db.tt/mcHWjfez

The m-files are BFGS, LBFGS and two mfiles with slightly modified versions of Diane O'Leary's M&T line search. I implemented BFGS and LBFGS myself by using the algorithms in Nocedal's book. I have been using these algorithms for a while now on various optimization problems and they seem to behave.

johnmyleswhite commented 11 years ago

Good call on checking if the inverse Hessian approximation was not positive definite. In several test problems, the BFGS rule was producing a matrix that was, indeed, not positive definite. Resetting the matrix to the identity in these cases fixes the failing tests.

timholy commented 11 years ago

Interesting. Unless there's a bug it should only be possible via roundoff error. Are you checking the analog of that "δTη>0" on that page I linked to above? I think that's equivalent to checking the sign of your rho variable in bfgs.jl.

Also, to me it looks like your update rule is needlessly computationally-expensive; it involves two multiplications of dense NxN matrices, an O(N^3) operation. An alternative formulation of the update (also on that page I linked to) is O(N^2) because it just involves rank-1 updates.

timholy commented 11 years ago

See also: http://en.wikipedia.org/wiki/BFGS_method which gives an update rule for the inverse Hessian, which is what your B seems to be. (Notation in this field is dreadfully confusing!)

johnmyleswhite commented 11 years ago

I think the problem lies somewhere between round-off errors and an ill-conditioned inverse Hessian.

I'll try using a Sherman-Morrinson update this week. My current implementation is just a textbook copy of BFGS as described in Nocedal and Wright.

Yes, B is the inverse Hessian in my implementation, although B is usually the Hessian itself (i.e. not its inverse) in the literature on BFGS. I'll probably choose another name at some point for greater clarity. I've started going through my code and standardizing names everywhere across functions to (a) make the codebase clearer and (b) make it easier to see how the algorithms relate to one another.

johnmyleswhite commented 11 years ago

At this point, I'm pretty happy with the line search code we now have. I've defaulted to Tim's HZ line search, which seems like the best of all the batch on average. If we have specification issues to resolve, let's handle them separately.

JuliaNLSolvers / Optim.jl

Line search, BFGS and the Wolfe conditions #26