JuliaNLSolvers / Optim.jl

Optimization functions for Julia
Other
1.1k stars 213 forks source link

Manifold optimization #433

Closed antoine-levitt closed 6 years ago

antoine-levitt commented 7 years ago

A number of problems involve minimizing a function on a Riemannian manifold defined by geometric constraints. An example is the eigenvalue problem by the minimization of the Rayleigh quotient under orthogonality constraints, nonlinear models in quantum chemistry (e.g. DFT), and several problems in matrix approximation. There are several toolboxes available: ROPTLIB has bindings to Julia, and there's https://github.com/NickMcNutt/ManifoldOptim.jl but it's abandoned (I emailed the author). One issue with these efforts is that they basically duplicate code from a general unconstrained optimizer.

As far as I can tell, the adaptation of an unconstrained optimizer to Riemannian optimization is pretty easy (at least for first-order algorithms): basically, it amounts to projecting the gradient onto the tangent manifold immediately after computing it, and retracting iterates back to the manifold after each step. Therefore, an implementation in Optim.jl would not be very intrusive. There would be a Manifold type with subtypes (e.g. Flat, Sphere, Stiefel...), functions like retract(M::Manifold, x) implemented for every subtype, and there would be an manifold option in Options, defaulting to Flat.

This issue is here to gauge interest: if I submit a PR implementing bare bones of this (e.g. gradient descent and CG for the sphere and stiefel manifold, and tests for the linear eigenvalue problem), will it get merged? The only potential issue I can see is that it does involve more code to develop/test for every kind of optimizer, but all of them don't have to support manifolds, and support can be added incrementally as interest arises.

(cc @cortner, who might be interested in the functionality)

cortner commented 7 years ago

Definitely interested. And have thought about it. Just never became urgent enough. I'd also like to add the suggestion to add curved linesearch

pkofod commented 7 years ago

This issue is here to gauge interest: if I submit a PR implementing bare bones of this (e.g. gradient descent and CG for the sphere and stiefel manifold, and tests for the linear eigenvalue problem), will it get merged?

Sure, why not. By your description, I think this is very easy to implement in a way that does not affect the compiled code of our curent algorithms for our current types of problems at all (empty fallbacks etc).

Is there some sort of intro reference for this? It is not my field at all!

timholy commented 7 years ago

How does this differ from general equality-constrained optimization? g(x) = 0 defines a manifold.

ChrisRackauckas commented 7 years ago

It can differ because you can force the linesearch to stay within the manifold, instead of searching the full space but relaxing to the manifold via a constraint. But arguably the best way to do this is on the user's side: choosing coordinates which are themselves constrained to be in the manifold (which would then force this on the linesearch). Or you can project to the manifold with each change, adding a nonlinear solve between each step. This can be required though if the equation off the manifold is just not well-behaved.

timholy commented 7 years ago

It can differ because you can force the linesearch to stay within the manifold, instead of searching the full space but relaxing to the manifold via a constraint.

That's well-known to be an inefficient approach for most problems, because it forces you to take tiny steps or perform second-order correction to high precision. Of course there are cases where you can't evaluate your objective off the manifold, so there is a place for this, but it should be understood that we'll get much bigger benefit from solving the general problem. Of course, projection methods are pretty simple in some ways, and so getting something simple merged might be a fast way to get some of the benefits of #303.

ChrisRackauckas commented 7 years ago

In DiffEq the callbacks are powerful enough that these kinds of projection methods can be just implemented as callbacks, and then there's a library of standard callback methods which includes a projection method. Maybe that's a good approach for Optim as well: instead of implementing a projection directly, allow the user to inject a function after each update, and have an example where this performs a manifold projection. This same interface can then serve a lot of other functions as well.

cortner commented 7 years ago

Eigenvalue Problems and Harmonic Maps are examples where staying on the manifold tends to be advantageous I think.

cortner commented 7 years ago

But in general I agree with @timholy. It's just that some of these special cases are incredibly important.

antoine-levitt commented 7 years ago

There's a report on ROPTLIB at http://www.math.fsu.edu/~whuang2/pdf/ROPTLIB_11WH_techrep.pdf that's recent and has references to the literature. The basic idea is that if you want to minimize a function defined on a sphere, instead of going in the direction of the gradient, you project the gradient on the tangent space to the sphere at the point you're at, do a step, and then map ("retract") back to the sphere. There are subtleties in extending more complex algorithms like CG and BFGS (how do you combine tangent vectors at different points on the manifold? what hessian should BFGS approximate?) and a whole lot of manifolds and variants. The sphere example is trivial, but the main interest in applications (at least from my perspective) is minimizing functionals on the space of matrices with orthogonal columns (the Stiefel manifold).

Compared to general constrained optimization, the benefit is that, when the manifold is simple enough that you can do the projection analytically, you can basically treat it as an unconstrained optimization problem and re-use the same algorithms. The typical use-case would be conjugate gradient or BFGS with orthogonality constraints, and I would imagine it's pretty hard to beat this kind of methods with a general constrained optimization algorithm (but I don't have a good benchmark)

timholy commented 7 years ago

Yes, if you can do analytic projection then this is a good strategy. I'd say go for it; @ChrisRackauckas is right about thinking about this primarily in terms of user-supplied functions for performing the projection.

pkofod commented 7 years ago

Ah, okay! Well, I still think this will be a change that doesn't hurt anything, and if someone stands to gain by explicitly handling the "manifold constraint" using an analytical projection, then let's go for it.

antoine-levitt commented 7 years ago

I had a look at implementing this and it is slightly more tricky than I thought it would be. In particular, line searches need to be aware of this because the line search is not on f(x + alpha d) but R(f(x+alpha d)), where R is a retraction (projection on the manifold). I've thought of a few ways of doing this:

Thoughts?

timholy commented 7 years ago

rewrite the linesearch API to be more like a 1D search

This should happen anyway, so even if painful seems unquestionably the way to go.

cortner commented 7 years ago

the @simd loops need to go as well and be replaced by broadcasts - see also #399.

Without further thinking about it, I quite like the first approach. This would only be feasible when the projections are relatively cheap anyhow and otherwise one would switch to standard constrained optimisation I think.

cortner commented 7 years ago

fake an objective function that does retraction

I don't like this; my current experiences with the string method suggest this will be awful

make optim and linesearches aware of manifolds

agree this will be bug-prone.

anriseth commented 7 years ago

rewrite the linesearch API to be more like a 1D search

This should happen anyway, so even if painful seems unquestionably the way to go.

Are there any line search algorithms out there where this would not work? If not, then I agree that this makes sense. I would make sure evaluations of the derivative based on the gradient of the objective is propagated back to Optim somehow (don't think that's very difficult).

cortner commented 7 years ago

the only case I can think of immediately that needs thinking about - and I am not even sure this is an exception - is non-monotone linesearch because it depends on the history of the optimisation. Does anybody have experience with that?

antoine-levitt commented 7 years ago

I don't think non-monotone linesearch fits into the current API either, it's pretty much orthogonal. As far as I can tell line search is exactly a 1D (approximate) minimization problem, and the linesearches redefine the 1D function (e.g. linefunc! in hagerzhang), so it would make more sense to just pass it a function (and the function can include some machinery to store the gradient evaluations, akin to what's done now).

About a manifold type, I see how it would work with updates like x = x + alpha d, and even for x .+= d (one would just need to overload broadcast for +, pass it to the data inside x and d, and retract back), but how to do it for something like x .+= alpha.d1 .+ d2? broadcast gets passed an arbitrary function, which would be hard to analyze.

If that can be done it may not be a bad idea... there would just be a type for a point on a manifold, a type for a tangent vector (that stores its base point so it can do vector transport if needed), and the only thing Optim needs to know is that it should project the gradient immediately after computation, and that it should only do "legitimate" manifold stuff (ie adds tangent vectors to points, vecdot() tangent vectors between themselves, etc). The user would just do optimize(obj_function, gradient_function!, Stiefel(input_data)) and everything would just work. Basically https://github.com/JuliaNLSolvers/Optim.jl/issues/399 on steroids: optimization doesn't even need a Hilbert space, it only needs a Riemann manifold with a retraction (x + d, where x is a point on the manifold and d a tangent vector to the manifold at x), a vector transport (dx + dy, where dx is a tangent vector at x and dy a tangent vector at y) and a projection on the tangent space.

Possibly the same interface can even be reused directly for DiffEq to solve differential equations on manifolds automatically, without callbacks. If that works, it would be pretty awesome ;-)

cortner commented 7 years ago

About a manifold type, I see how it would work with updates like x = x + alpha d, and even for x .+= d (one would just need to overload broadcast for +, pass it to the data inside x and d, and retract back), but how to do it for something like x .+= alpha.d1 .+ d2? broadcast gets passed an arbitrary function, which would be hard to analyze

d1, d2 are vectors in the tangent space. alpha a scalar. So alpha * d1 + d2 is just standard addition. x on the other hand is a point on the manifold. so x + d is the special addition. One just needs to distinguish points from vectors

antoine-levitt commented 7 years ago

That's right, which is why x += alpha*d1 + d2 would be trivial to implement (overload scalar*vector, vector+vector, point+vector), my concern is with broadcast: how do you do that when it gets lowered to broadcast(some_fun, x, alpha, d1, d2)?

cortner commented 7 years ago

Basically #399 on steroids: optimization doesn't even need a Hilbert space, it only needs a Riemann manifold

I like that. I think we should implement a prototype in a separate branch.

cortner commented 7 years ago

That's right, which is why x += alphad1 + d2 would be trivial to implement (overload scalarvector, vector+vector, point+vector), my concern is with broadcast: how do you do that when it gets lowered to broadcast(some_fun, x, alpha, d1, d2)?

I see - I misunderstood. For what it's worth I always thought doing everything in-place in those operations is overkill. But maybe I am wrong, maybe worth thinking about more carefully.

johnmyleswhite commented 7 years ago

I see - I misunderstood. For what it's worth I always thought doing everything in-place in those operations is overkill. But maybe I am wrong, maybe worth thinking about more carefully.

The original code didn't do things in-place and was noticeably slower. That was long ago in Julia history, but I'd want to see benchmarks before removing that optimization.

cortner commented 7 years ago

Then I suspect that the objectives I am used to are just significantly more costly than the ones you used for the benchmarks.

cortner commented 7 years ago

Anyhow why can't one dispatch the broadcast on the types of the arguments?

johnmyleswhite commented 7 years ago

Anyhow why can't one dispatch the broadcast on the types of the arguments?

Indeed, using multiple dispatch for optimizations is a large benefit of lowering dot-syntax to broadcast.

pkofod commented 7 years ago

The problems that we have in UnconstrainedProblems are so cheap to evaluate that inplace updating makes a difference. Some of the larger problems in CUTEst will take much longer to evaluate the objective and gradients than updating the current iterate. However, I do think that we need to cater to both needs, as I suspect that Optim is used to optimize "cheap" functions repeatedly quite a lot, so there it would be a regression to remove inplace operations, even if only for updating the iterate.

pkofod commented 6 years ago

Thanks for this, if someone feels like some of the discussion should be continued, please open new issues for each topic (discussion seemed to go a bit off-topic)