JuliaNLSolvers / Optim.jl

Optimization functions for Julia
Other
1.12k stars 217 forks source link

Nonlinear preconditioning / acceleration algorithms #477

Closed anriseth closed 6 years ago

anriseth commented 6 years ago

I'm interested in nonlinear (right) preconditioning for nonlinear systems and optimization, also referred to as acceleration. A discussion on how to support this would be interesting. PETSc supports this for nonlinear systems, which allows you to mix and match solvers in order to address particular features of a problem in a flexible way. (Brune et al. Composing Scalable Nonlinear Algebraic Solvers). For existing literature on optimization acceleration, see https://arxiv.org/abs/1606.04133 and https://arxiv.org/abs/1106.4426.

In particular, I'm interested in implementing Nonlinear GMRES and my own variant, as they can just be added on top of any of the existing solvers we already have -- often with great benefit to high-dimensional problems. See a manuscript I recently wrote: Objective acceleration for unconstrained optimization (@cortner you mentioned that you'd be interested to have a look)

So the design and implementation options for nonlinear preconditioning are:

I can also add that @antoine-levitt implemented the Nonlinear GMRES idea for NLsolve, in form of Anderson acceleration, as requested by @ChrisRackauckas. So taking the approach of enabling preconditioning for all solvers is equally relevant for NLsolve.

ChrisRackauckas commented 6 years ago

I think it should be algorithm-generic. It shouldn't be too hard.

And reading https://arxiv.org/pdf/1106.4426.pdf , it should be generalized a bit more. They essentially just have the right preconditioner, but no left preconditioner. It should be a four step algorithm:

1) Right precondition to get ubar 2) GMRES step uhat 3) Potential update utilde 4) Left precondition to the u(n+1)

This is not the same as right preconditioning twice because GMRES doesn't internally use a vector of right preconditioned values, but it would use the left preconditioned values. Essentially what you're doing is taking a step, and tagging on a cheap gradient decent to nudge it a little more downhill each step. Given the importance of this difference in linear solvers, I am surprised it's not mentioned in nonlinear solvers (that I know of).

But yes, this would definitely be something interesting to play with, and if implemented generically could lead to benchmarking papers showing how much acceleration helps/hurts different methods.

antoine-levitt commented 6 years ago

The following are not clear to me :

There's an infinite number of variants and combinations of standard approaches that can be thought up for optimization problems. The problem with nonlinear optimization is that, well, it's nonlinear, and very similar algorithms can have wildly different behaviors on different problems (even for non symmetric linear systems it's already a mess...) While in general more methods is always good, it does lead to feature creep that makes the API more complicated and create a burden on future development, so I'd be cautious of implementing methods unless they have stood the test of time or are simple enough that they don't have a major impact on the code.

anriseth commented 6 years ago

is acceleration useful for any method other than steepest descent?

I have used N-GMRES successfully with Newton on nonlinear problems arising from PDEs.

The problem with nonlinear optimization is that, well, it's nonlinear, and very similar algorithms can have wildly different behaviors on different problems (even for non symmetric linear systems it's already a mess...)

Some of these branching type methods can even result in very different behaviour between computer systems, I have experienced this myself. Standard N-GMRES resets history whenever it proposes a bad direction, I had two systems where one reset and the other did not due to floating point differences...

While in general more methods is always good, it does lead to feature creep that makes the API more complicated and create a burden on future development, so I'd be cautious of implementing methods unless they have stood the test of time or are simple enough that they don't have a major impact on the code.

Do you think an extra line calling a nonlinear preconditioner (default = nothing) can be done without causing large API complications?

antoine-levitt commented 6 years ago

That's very interesting. It's in http://www.pefarrell.org/wp-content/uploads/2015/09/riseth2015.pdf, correct? That seems nuts to me, especially if, as you write, Newton is in its quadratic convergence regime. You report a speedup of 10%, which does not seem that significant, are you sure it's not just noise?

I have a colleague that implemented Anderson (DIIS) on top of Broyden and reported a very large acceleration, but I'm not sure if that's real or an artefact of a bad Broyden implementation. It's very hard to say anything for certain in this domain...

ChrisRackauckas commented 6 years ago

Do you think an extra line calling a nonlinear preconditioner (default = nothing) can be done without causing large API complications?

Yes

is acceleration useful for any method other than steepest descent? CG and bfgs variants already combine past iterates, so I would not expect acceleration to be useful. For steepest descent, is it better than l-bfgs?

There's an infinite number of variants and combinations of standard approaches that can be thought up for optimization problems. The problem with nonlinear optimization is that, well, it's nonlinear, and very similar algorithms can have wildly different behaviors on different problems (even for non symmetric linear systems it's already a mess...) While in general more methods is always good, it does lead to feature creep that makes the API more complicated and create a burden on future development, so I'd be cautious of implementing methods unless they have stood the test of time or are simple enough that they don't have a major impact on the code.

That's very interesting. It's in http://www.pefarrell.org/wp-content/uploads/2015/09/riseth2015.pdf, correct? That seems nuts to me, especially if, as you write, Newton is in its quadratic convergence regime. You report a speedup of 10%, which does not seem that significant, are you sure it's not just noise?

I have a colleague that implemented Anderson (DIIS) on top of Broyden and reported a very large acceleration, but I'm not sure if that's real or an artefact of a bad Broyden implementation. It's very hard to say anything for certain in this domain...

If the general logic is separated from the implementation of a specific method, then adding on a pile of methods doesn't really make maintenance more difficult. There is an issue with documentation at that point, but I like libraries that aren't just copying well-known methods, but also are fully-featured enough to be a research tool themselves. If you implement the preconditioners so they do work with everything, then those questions you posed could be studied in detail using Optim instead of specialized implementations. IMO that's how Julia packages will get an edge.

anriseth commented 6 years ago

That's very interesting. It's in http://www.pefarrell.org/wp-content/uploads/2015/09/riseth2015.pdf, correct?

I made further tests on harder problems using other academic reservoir simulators this spring, with more significant improvements. For example, a reduction from 770 nonlinear solves to 543 over the course of a simulation (i.e. aggregate over multiple timesteps). Some of this is due to the simulator halving time-steps when (non-preconditioned) Newton "fails" (more than 40 steps without solving the problem).

That seems nuts to me, especially if, as you write, Newton is in its quadratic convergence regime. You report a speedup of 10%, which does not seem that significant, are you sure it's not just noise?

I believe the reduction is due to two things: A slight nudge that puts the particular solve below the tolerance in one fewer iterations, and stabilisation of the time-steps within a simulation where Newton is not in the quadratic regime (which indeed is very few). The results in the report come from different four different simulations, that all need to solve 100 - 2000 nonlinear problems/time steps each, so I hope the number of nonlinear problems solved should reduce the "noise" (?).

It's very hard to say anything for certain in this domain...

I agree, however, it seems to be a simple way to improve existing implementations :)

pkofod commented 6 years ago

[...] or an artefact of a bad Broyden implementation. It's very hard to say anything for certain in this domain...

This is indeed a very real challenge. Many papers fail to provide enough details such that you can expect the XYZ method in five different packages to give the same trace and result. Corner cases, handling of ill-conditioned problems, bla, bla, bla, makes this so difficult to speak about.

This is also why it's important for code to be quite clear and well-documented. I don't think Optim is quite there yet (or there at all), so needless to say, I'd expect a fair level of detail when it comes to documentation, comments, and design in new stuff! I bothered @antoine-levitt with these things in his Manifold PR, but it's a very real problem that maintenance can become an issue (as mentioned by @ChrisRackauckas ) when we're simply not a team of five maintaining a commercial pieces of software that we're paid to produce.

That said, I'm confident that @anriseth will be able to deliver in all of these areas. I should maybe clarify this somewhere, but feature additions are fine as long as it's documented, commented, somewhat neatly/compactly designed, tested, and useful on some level.