JuliaNLSolvers / Optim.jl

Optimization functions for Julia
Other
1.13k stars 218 forks source link

LBFGS first step too large #919

Closed knuesel closed 6 months ago

knuesel commented 3 years ago

I got a false convergence result minimizing -logpdf(Normal(150, σ) , 147.9) with L-BFGS . My σ is defined on [0, 50], so Turing.jl maps it to (-∞, ∞) using a logit/sigmoid transformation through Bijectors.jl. See the initial issue file at https://github.com/TuringLang/Turing.jl/issues/1597 for details.

Here is a reduced example with σ in [0, 5]:

using Distributions, Optim, Plots

default(legend=false, size=(400, 250), titlefont=font(11))

# Function to minimize for σ ∈ [0, 5]
mloglikelihood(σ) = -logpdf(Normal(150, σ) , 147.9)

# Plot in original coordinates
σ = 0.4:0.01:5
plot(σ, mloglikelihood.(σ), xlabel="σ", title="Function to minimize on [0, 5]") |> display

# Map σ ∈ [0, 5] to x ∈ (-∞, ∞) using logit
link(σ) = -log(5/σ - 1)

# Inverse map of x ∈ (-∞, ∞) to σ ∈ [0, 5] using sigmoid
invlink(x) = 5/(1+exp(-x))

# Function to minimize for x ∈ (-∞, ∞)
f(x) = mloglikelihood(invlink(x[1]))

# Initial value: σ = 0.4  ↦  x ≈ -2.44
initial = [link(0.4)]

# Optimize on (-∞, ∞)
opt = optimize(f, initial, LBFGS(), Optim.Options(show_trace=true, extended_trace=true, store_trace=true))

# Result: x = 21.99515296773123  ↦  σ = 4.999999998598489
invlink(opt.minimizer[1])

# Plot steps
x = -2.44:0.1:22
plot(x, f.(x), xlabel="link(σ)", title="Function to minimize on (-∞, ∞)")
scatter!([(step.metadata["x"][1], step.value) for step in opt.trace])

Output:

image

Iter     Function value   Gradient norm 
     0     1.378390e+01     2.443750e+01
 * Current step size: 1.0
 * time: 6.008148193359375e-5
 * g(x): [-24.4375000033305]
 * x: [-2.4423470353692043]
     1     2.616576e+00     2.300624e-10
 * Current step size: 0.9999999999905856
 * time: 0.0002739429473876953
 * g(x): [2.300623668166915e-10]
 * x: [21.99515296773123]

image

The red markers show the initial value and first (=last) iteration. As you can see, the large initial gradient (-24.4) sends the first iteration far in the flat region on the right, where the |g(x)| condition is immediately satisfied.

Of course the user can fix it by using a better initial value, or decreasing the step-size (e.g. with alphaguess=0.01). However it was not obvious in my case as the coordinate transformation and algorithm selection was made automatically behind the scenes. So I guess it would be nice to make the algorithm more robust to this issue... Here's an idea:

It seems unlikely that the optimization succeeds in one step? So if that happens, Optim could try again with alphaguess divided by 10 (repeating a few times before giving up).

Or do you think such logic should rather be added to downstream packages like Turing.jl?

pkofod commented 3 years ago

I don't know, it's hard to do something in general here. If you think link(sigma) == 20 is too large, maybe the right thing to do is to apply bounds? I know it's hard to know a priori sometimes, but... It's also hard for Optim to know whether it actually jumped to a local minimum. You mention one iteration, but what about two then? Or if you started extremely close to the optimum then you'll end up doing two iterations instead of one... Functions with flat regions are hard for local optimizers. Maybe the appropriate thing would be to start with a few different starting values?

ChrisRackauckas commented 3 years ago

Isn't the solution here to just use LBFGS(initial_stepnorm = 0.01) or whatever else is small enough?

pkofod commented 3 years ago

Isn't the solution here to just use LBFGS(initial_stepnorm = 0.01) or whatever else is small enough?

Yes, but that option is only available for BFGS right now. But I agree as I hope was clear above: I don't think a heuristic will be easy to find or generally applicable. A local gradient based method will just have a hard time with functions shaped like the one above sometimes. There's a reason why so many of the canonical optimization test sets are valleys or regions with steep edges combined with almost flat regions.

knuesel commented 3 years ago

Yes in practice I solve it with a small alphaguess and trying different starting values. I had the vague idea that "convergence in one iteration" might be a good heuristic to detect when the initial step is too large, but I don't have much perspective on this. Feel free to close the issue.