JuliaNLSolvers / Optim.jl

Optimization functions for Julia
Other
1.12k stars 218 forks source link

How to configure Optim.jl's LBFGS to have similar parameters to scipy.optimize's L-BFGS-B? #922

Open rht opened 3 years ago

rht commented 3 years ago

I'm @**Rein Zustand** on julialang.zulipchat.com who asked a question there on this topic. See @pkofod 's answer at https://julialang.zulipchat.com/#narrow/stream/274208-helpdesk-.28published.29/topic/why.20is.20Optim.2Ejl.20so.20slow.3F/near/227892581. The difference is that this time I have got the permission to publish a subset of my model on GitHub for a concrete context: https://github.com/rht/climate_stress_test_benchmark (the code is ~300 LOC in Julia / Python).

I have also asked at Julia Discourse for insights (see https://discourse.julialang.org/t/optim-jl-vs-scipy-optimize-once-again/61661).

The benchmark result:

  1. Python objective function, scipy.optimize with L-BFGS-B, ftol=2.220446049250313e-09 (default), gtol=1e-5 (default), maxls=20 (default): 0.448 ± 0.013 s
  2. Julia objective function, but uses scipy.optimize with the same params as previous point: 0.014093, which 31.8x faster than full Python version
  3. Julia objective function, Optim.jl with LBFGS, f_tol=2.2e-9, g_tol=1e-5, HagerZhang with linesearchmax=20 (those params are explicitly set): 700.513 ms (3365 allocations: 148.81 KiB) using @btime

The remaining difference between Optim.jl and scipy.optimize seems to be that Optim.jl uses HagerZhang line search, which is probably inherently slower than scipy.optimize's dcsrch, which is extracted from Minpack2 with some modification (function definition of the line search algorithm at https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L3326). Some comment (https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L2439-L2480):

c     This subroutine calls subroutine dcsrch from the Minpack2 library
c       to perform the line search.  Subroutine dscrch is safeguarded so
c       that all trial points lie within the feasible region.
c
c     Be mindful that the dcsrch subroutine being called is a copy in
c       this file (lbfgsb.f) and NOT in the Minpack2 copy distributed
c       by scipy.

Which line search algorithm should I use from LineSearches.jl, and with what params so that it is most similar to scipy.optimize's L-BFGS-B? If such line search algorithm does not exist yet in LineSearches.jl, do you think it is worth it to port dcsrch to LineSearches.jl?

antoine-levitt commented 3 years ago

Try backtracking linesearch. I've found it to be the best choice among the available ones in Optim. For LBFGS there's not much point in selecting a very fancy linesearch, except for added robustness at the beginning of the optimization

rht commented 3 years ago

I tried the backtracking linesearch using this code:

linesearch = LineSearches.BackTracking(iterations = 20)

1.024 s (19868 allocations: 935.33 KiB)

 * Status: success

 * Candidate solution
    Final objective value:     -6.263329e+00

 * Found with
    Algorithm:     Fminbox with L-BFGS

 * Convergence measures
    |x - x'|               = 1.41e+00 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.62e-01 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 2.2e-09
    |g(x)|                 = 1.95e-10 ≤ 1.0e-05

 * Work counters
    Seconds run:   3  (vs limit Inf)
    Iterations:    3
    f(x) calls:    811
    ∇f(x) calls:   555

It's slower than HagerZhang with maxlinesearch of 20. Here is the info for HagerZhang:

 * Status: success

 * Candidate solution
    Final objective value:     -1.617721e+00

 * Found with
    Algorithm:     Fminbox with L-BFGS

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 2.2e-09
    |g(x)|                 = 4.68e-01 ≰ 1.0e-05

 * Work counters
    Seconds run:   3  (vs limit Inf)
    Iterations:    5
    f(x) calls:    384
    ∇f(x) calls:   384
antoine-levitt commented 3 years ago

I'd check carefully your problem and convergence criteria. The number of iterations is abnormally low, and the number of function and gradient evaluations is abnormally high, suggesting that the line search is very much struggling. Is your function smooth, and is your gradient correct?

rht commented 3 years ago

I didn't specify any gradient function, so I assume it's finite difference. I can't give a rigorous answer about the smoothness of the function (it's a 30-dimensional function), but I have extensively tested the Python version and its optimization result is satisfactory.

antoine-levitt commented 3 years ago

Maybe Optim is less optimized for when gradients are not explicitly available (eg computes the full gradient by finite differences just to do a linesearch). Try with an explicit gradient

longemen3000 commented 3 years ago

From the original paper of the Fortran algorithm, they cite the MoreThuente line search without further modifications. The paper cited at the start of the library is about a modification of that code, not touching the line search,but the calculation of the eps

rht commented 3 years ago

Try with an explicit gradient

At least, I have got help in Julia Discourse to modify my code so that ForwardDiff works for calculating the gradient. The code is 21.6% faster with HagerZhang(maxlinesearch = 20), but still slower than the 100% Python version.

I tried MoreThuente(maxfev = 20) (and also with default maxfev) but failed my energy conservation assertion due to the numbers becoming NaN. This NaN problem is not present in HagerZhang nor BackTracking.

antoine-levitt commented 3 years ago

ForwardDiff is still slow, I mean with an explicitly coded (hand derived) gradient

rht commented 3 years ago

My model is a discrete-event simulation, so I'm not exactly sure how to derive the gradient.

Maybe Optim is less optimized for when gradients are not explicitly available (eg computes the full gradient by finite differences just to do a linesearch).

What kind of improvement need to be done for Optim.jl so that it can match scipy.optimize where both are using finite difference? I don't think scipy.optimize is doing some special tricks? I think if I use Optim.jl using the same line search that scipy.optimize.minimize(method="L-BFGS-B") uses, I will observe that Optim.jl is faster.

antoine-levitt commented 3 years ago

My model is a discrete-event simulation, so I'm not exactly sure how to derive the gradient.

Usually that can be done with adjoint methods. It will likely be much more efficient than finite differences.

What kind of improvement need to be done for Optim.jl so that it can match scipy.optimize where both are using finite difference? I don't think scipy.optimize is doing some special tricks? I think if I use Optim.jl using the same line search that scipy.optimize.minimize(method="L-BFGS-B") uses, I will observe that Optim.jl is faster

I don't know whether your last sentence is true or not: this is why it would be useful to have a version with explicit gradient to be able to see where the issue comes from. I don't think Optim's first-order optimization methods have been extensively tested/designed with problems without explicit gradient in mind, so it might be something stupid.