JuliaNLSolvers / Optim.jl

Optimization functions for Julia
Other
1.12k stars 217 forks source link

API/Examples of nested optimization? #551

Closed anriseth closed 5 years ago

anriseth commented 6 years ago

Mentioned by @ahwillia in the JOSS review.

Is there a nice API to do a nested optimization? Let's say I wanted to use rand_search to tune a few hyperparameters of a complex model. For each set of hyperparameters I used a more sophisticated method (e.g. L-BFGS). That would be a super rad use case of this package!

So this seems like a use case that can be relevant for statistical/machine learning.

If anybody wants to have a go at providing an example/documentation/necessary code to do nested optimization, that's much appreciated :)

pkofod commented 6 years ago

The only thing I'm a bit curious about, is how you would pass the first set of parameters in a generic fashion. The simplest way that I can think of to do right now would be something like:

https://github.com/pkofod/Misc/blob/master/Untitled38.ipynb

Basically I use SimulatedAnnealing to find a starting point for BFGS that gives me the lowest objective value, such that if I start BFGS at the minimizer returned by SA, I get exactly 0.0 as the minimum, but if I start BFGS at the point where I start SimulatedAnnealing, it's "only" around 1e-25... What @ahwillia is probably interested in, is something like

f_opt(hyperpar) = Optim.minimum(optimize(x->f(x; hyperparameters=hyperpar), inner_x0, ...)))
optimize(f_opt, lb, ub, RandSearch())

I think that it might actually be easy enough that we don't need a specific interface for it, but we do need a tutorial for it (I just need a better example where we're optimizing over some hyperparameters on the outside).

antoine-levitt commented 6 years ago

I think that it might actually be easy enough that we don't need a specific interface for it

Agreed. A nice thing about julia libraries is easy composability, so I don't think any specific development is needed. An interesting trivia about nested optimization that is usually very useful and might be good to point out in a tutorial is that the gradient of the outer objective function is readily available: if f(x) = min_y g(x,y), then f'(x) = ∂g/∂x + ∂g/∂y dy/dx = ∂g/∂x (if the inner optimizer is well-converged so that ∂g/∂y = 0). This is much quicker than using finite differences or autodiff (which would require differentiating the minimization process)

mohamed82008 commented 6 years ago

This is for unconstrained smooth minimization.

pkofod commented 6 years ago

This is for unconstrained smooth minimization.

What do you mean?

Edit: oh you were referring to Antoine's point?

antoine-levitt commented 6 years ago

Smooth yes, unconstrained no: even if the minimization is constrained on y, the statement holds (∂g/∂y is non-zero, but dy/dx is in its kernel)

pkofod commented 6 years ago

An interesting trivia about nested optimization that is usually very useful and might be good to point out in a tutorial is that the gradient of the outer objective function is readily available: if f(x) = min_y g(x,y), then f'(x) = ∂g/∂x + ∂g/∂y dy/dx = ∂g/∂x (if the inner optimizer is well-converged so that ∂g/∂y = 0). This is much quicker than using finite differences or autodiff (which would require differentiating the minimization process)

Will remember to point that out.

mohamed82008 commented 6 years ago

(∂g/∂y is non-zero, but dy/dx is in its kernel)

May you elaborate? dy/dx is not explicit since it depends on how x affects the optimal solution. Edit: unless one can find an explicit function for optimal y as a function of x which is unlikely, or one would be optimizing that function instead of doing nested optimization.

antoine-levitt commented 6 years ago

That's right, but the beauty of it is that it doesn't matter. The statement is that ∂g/∂y is zero on any admissible direction for y (if it wasn't, you could decrease the objective by going in that direction a bit). Since dy/dx is an admissible direction, that term is zero. You can play the same trick with Lagrange multipliers if you prefer that formalism, see for instance the wikipedia page for the Hellmann-Feynman theorem, applied to the case of the minimization of <y, H(x) y> s.t. ||y||=1 where H is a hermitian matrix depending on x. (of course, this is all assuming that the minimization problem is non-degenerate, and therefore by the IFT that y(x) is smooth)

anriseth commented 6 years ago

(∂g/∂y is non-zero, but dy/dx is in its kernel)

I have an example, which does not seem to adhere to this (where am I going wrong?) g(x,y) = (x-y)^2 with constraint y>= 2x. Then y(x) = 2x, ∂g/∂y = -2(x-y), but dy/dx = 2, which is not in the kernel of ∂g/∂y|_{y=2x} = 2x.

antoine-levitt commented 6 years ago

Yes sorry, I'm assuming the constraint does not depend on x (if it does I think the calculation can be adapted, you just have to include the Lagrange multiplier of the constraint).

mohamed82008 commented 6 years ago

If the solution space doesn't change with x, that makes sense, but if it does, I doubt this can be directly done. Bender's decomposition for instance handles LPs where linear constraints depend on separable variables, which is a special case of the problem described here.

antoine-levitt commented 6 years ago

Unless I'm mistaken that still works. Assume a single constraint h(x,y) = 0, then at the minimum dg/dy + λ dh/dy = 0, so dg/dy dy/dx = - λ dh/dy dy/dx = + λ dh/dx, so you just have to compute the derivative of the constraint with respect to x at the minimum, and the Lagrange multiplier. I think this is simply the interpretation of Lagrange multipliers as the marginal cost of a constraint.

mohamed82008 commented 6 years ago

Hmm, lambda itself could be a function of x!

mohamed82008 commented 6 years ago

Actually you never assumed lambda was constant, my bad. I think that makes sense.

Edit: This is neat! I am not a nested optimization guy (yet!) so this was kind of surprising. In the general case, one would consider all active constraints and their respective multipliers. And one assumption made here is that the optimal y is a KKT point, for which a sufficient condition is that constraint qualifications are satisfied, which is a reasonable assumption in most cases I believe, unless the objective and constraints are nasty by design which would be unfortunate!

It is also worth mentioning that the solver must be reporting the Lagrangian multipliers or this won't work!

antoine-levitt commented 6 years ago

Of course it does, that's the point of Lagrange multipliers :p that's actually a slightly generalized version of the last equation in https://en.wikipedia.org/wiki/Lagrange_multiplier#Interpretation_of_the_Lagrange_multipliers ("a Lagrange multiplier is the change in the optimal value of the objective function (profit) due to the relaxation of a given constraint")

ahwillia commented 6 years ago

Great discussion! I just want to point one thing out briefly:

the gradient of the outer objective function is readily available

In the applications I'm thinking about this gradient either doesn't exist or is too expensive to compute. For example, if you are optimizing a matrix decomposition problem, or some problem with a discrete number of latent states / clusters. In these cases the hyperparameter is a discrete variable.

Another use case is fitting a deep net via stochastic gradient descent with a fixed learning rate (or even better, a learning rate that decreases according to some schedule). In these cases, it would be very expensive to compute/approximate the necessary gradient.

The second example is maybe outside the scope of what Optim is interested in... But maybe there are others?

mohamed82008 commented 6 years ago

Unless I'm mistaken that still works. Assume a single constraint h(x,y) = 0, then at the minimum dg/dy + λ dh/dy = 0, so dg/dy dy/dx = - λ dh/dy dy/dx = + λ dh/dx

On a second less sleepy thought, the above derivation is slightly off. In the lower level problem, minimizing g(x,y) with respect to y given x is the goal. These conditions will be satisfied at the optimal y* if the point y is a KKT point. -∂g/∂y + λ ∂h/∂y = 0

The goal of the higher level optimization problem is to minimize g(x,y(x)) where y(x) is the implicit function that returns the optimal solution y to the lower level problem given x. The full derivative of g with respect to x is ∂g/∂x + ∂g/∂y dy/dx. Since y(x) is not explicit, one would ideally want to eliminate or find dy/dx without going through y(x). Since at the optimal solution of the lower level problem y = y(x) and g(x,y) = g(x,y(x)), let's overload y now and treat it as a function in the higher level problem. This relationship still holds -∂g/∂y + λ ∂h/∂y = 0.

Multiplying by dy/dx, -∂g/∂y dy/dx + λ ∂h/∂y dy/dx = 0. Adding suitably weighted ∂g/∂x and ∂h/∂x on both sides, -(∂g/∂x + ∂g/∂y dy/dx) + λ (∂h/∂x + ∂h/∂y dy/dx) = -∂g/∂x + λ ∂h/∂x. This reduces to -dg/dx + λ dh/dx = -∂g/∂x + λ ∂h/∂x. So we have a relationship between dg/dx and dh/dx both of which are unknown. This means we need another equation to help us find both since we have no way of computing the full derivative of the constraint or the objective unless it is a function of x only which wouldn't be a nested problem anymore.

A second inequality comes from knowing that the solution space does not change with respect x, and therefore assuming that the lower level optimization algorithm guarantees that the directional derivative of g with respect to y in every feasible direction is non-negative, that is ∂g/∂y dy/dx >= 0 (intuitively that means that an infinitesimal change in x can only infinitesimally change y in a feasible direction which does not decrease g given x), then dg/dx >= ∂g/∂x. That is we get a lower bound on the full derivative.

So from -dg/dx + λ dh/dx = -∂g/∂x + λ ∂h/∂x, we can re-arrange dg/dx = ∂g/∂x + λ (dh/dx - ∂h/∂x) >= ∂g/∂x and therefore λ (dh/dx - ∂h/∂x) >= 0 which is the same as λ ∂h/∂y dy/dx >= 0. .

Did I go wrong somewhere?

mohamed82008 commented 6 years ago

The second example is maybe outside the scope of what Optim is interested in... But maybe there are others?

I don't see what stops you from defining the parameters to the lower level optimizer as variables in the higher level function. The only thing is that you have to stick to zero order optimization methods for the higher level problem, use auto-diff through the optimization algorithm, or find an explicit form of the derivative with respect to the hyper-parameter as in this paper: https://arxiv.org/abs/1703.04782.

mohamed82008 commented 6 years ago

The only case I think Optim will have a hard time handling is the discrete parameter case, since Optim only does continuous optimization.

pkofod commented 6 years ago

In the case that was the motivation for this issue, we could support diskrete values as long as there’s a set of diskrete numbers we can sample from. Can you reference some simple tuning problems se could try @ahwillia?

antoine-levitt commented 6 years ago

@mohamed82008 I was too quick, and partial derivatives are awkward to type, but you have to use d/dx h(x, y(x)) = 0 (because h(x,y(x)) = 0 for all x), and so ∂h/∂x + ∂h/∂y dy/dx = 0. The end formula only has ∂h/∂x in it, which is easy to compute. In the special case where the constraint is of the form h(y)=x, you get the interpretation of Lagrange multipliers as marginal costs.

@ahwillia Discrete optimization and stochastic optimization are indeed tricky; the general idea which I'm yet to see a counter-example of is that whenever a gradient really makes sense it can be computed efficiently, even for complicated functions with other loops inside.

mohamed82008 commented 6 years ago

@antoine-levitt Interesting! That's true for equality constraints though, for active inequality constraints of the form h(x,y) <= 0, the best guarantee we have at the top level problem is that h(x,y(x)) can only decrease by changing x, so dh/dx <= 0, unless one can prove that the constraint will remain active by changing x, in which case it is equivalent to the equality version.

However, by complimentarity slackness, λ(x) h(x,y(x)) = 0 -> dλ/dx h(x,y(x)) + λ(x) dh/dx = 0 -> λ(x) dh/dx = -dλ/dx h(x,y(x)). Multiplying by λ(x) -> (λ(x))^2 dh/dx = -dλ/dx λ(x) h(x,y(x)) = 0. Therefore for any x, either λ(x) = 0 or dh/dx = 0. So one can still eliminate the second term in -dg/dx + λ dh/dx = -∂g/∂x + λ ∂h/∂x -> -dg/dx = -∂g/∂x + λ ∂h/∂x.

The only missing but not so important piece I think is when λ = 0, I am not sure how to find dh/dx. I think the best guarantee we have is that dh/dx <= 0. Not that it matters much anyways 😆. This discussion is very useful for my work so thanks a lot!!!

antoine-levitt commented 6 years ago

I'm very rusty on inequality constrained optimization, but the only problematic part as you say is when an active constraint becomes inactive when x moves a bit. In this case it's not obvious that everything stays smooth and the gradient is well-defined (probably this can be ensured with the appropriate qualification conditions, or maybe the function is not differentiable and one has to look at subgradients?), but if it does the result (gradient = ∂g/∂x) is valid by continuity.

ahwillia commented 6 years ago

the general idea which I'm yet to see a counter-example of is that whenever a gradient really makes sense it can be computed efficiently

Often in hyperparameter optimization I don't think it makes sense to use gradients, even if you can compute them. Since you are operating in a relatively low-dimensional space, possibly with a very nonconvex function, it makes sense to do a more exploratory/global optimization than a local gradient-based approach.

Can you reference some simple tuning problems se could try @ahwillia?

A very, very simple one would be LASSO regression where one does a 1D grid search over the strength of the penalty parameter. Since the problem is convex you can use warm starts for each inner loop. If you want to extend this to a 2D example you could try elastic net regression.

To add a discrete variable, you could try something like sparse PCA (which is basically LASSO penalty on a matrix factorization). In that case you would be tuning a penalty parameter as well as the discrete number of hidden states/components. I don't think the Eckart-Young theorem applies to Sparse PCA, so you have to do hyperparameter search over the number of components/clusters in the model

pkofod commented 5 years ago

I'm going to close this in favor of the rand search one.