Open rht opened 3 years ago
Try backtracking linesearch. I've found it to be the best choice among the available ones in Optim. For LBFGS there's not much point in selecting a very fancy linesearch, except for added robustness at the beginning of the optimization
I tried the backtracking linesearch using this code:
linesearch = LineSearches.BackTracking(iterations = 20)
1.024 s (19868 allocations: 935.33 KiB)
* Status: success
* Candidate solution
Final objective value: -6.263329e+00
* Found with
Algorithm: Fminbox with L-BFGS
* Convergence measures
|x - x'| = 1.41e+00 ≰ 0.0e+00
|x - x'|/|x'| = 2.62e-01 ≰ 0.0e+00
|f(x) - f(x')| = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 2.2e-09
|g(x)| = 1.95e-10 ≤ 1.0e-05
* Work counters
Seconds run: 3 (vs limit Inf)
Iterations: 3
f(x) calls: 811
∇f(x) calls: 555
It's slower than HagerZhang with maxlinesearch
of 20. Here is the info for HagerZhang:
* Status: success
* Candidate solution
Final objective value: -1.617721e+00
* Found with
Algorithm: Fminbox with L-BFGS
* Convergence measures
|x - x'| = 0.00e+00 ≤ 0.0e+00
|x - x'|/|x'| = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')| = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 2.2e-09
|g(x)| = 4.68e-01 ≰ 1.0e-05
* Work counters
Seconds run: 3 (vs limit Inf)
Iterations: 5
f(x) calls: 384
∇f(x) calls: 384
I'd check carefully your problem and convergence criteria. The number of iterations is abnormally low, and the number of function and gradient evaluations is abnormally high, suggesting that the line search is very much struggling. Is your function smooth, and is your gradient correct?
I didn't specify any gradient function, so I assume it's finite difference. I can't give a rigorous answer about the smoothness of the function (it's a 30-dimensional function), but I have extensively tested the Python version and its optimization result is satisfactory.
Maybe Optim is less optimized for when gradients are not explicitly available (eg computes the full gradient by finite differences just to do a linesearch). Try with an explicit gradient
From the original paper of the Fortran algorithm, they cite the MoreThuente line search without further modifications. The paper cited at the start of the library is about a modification of that code, not touching the line search,but the calculation of the eps
Try with an explicit gradient
At least, I have got help in Julia Discourse to modify my code so that ForwardDiff works for calculating the gradient. The code is 21.6% faster with HagerZhang(maxlinesearch = 20)
, but still slower than the 100% Python version.
I tried MoreThuente(maxfev = 20)
(and also with default maxfev
) but failed my energy conservation assertion due to the numbers becoming NaN. This NaN problem is not present in HagerZhang nor BackTracking.
ForwardDiff is still slow, I mean with an explicitly coded (hand derived) gradient
My model is a discrete-event simulation, so I'm not exactly sure how to derive the gradient.
Maybe Optim is less optimized for when gradients are not explicitly available (eg computes the full gradient by finite differences just to do a linesearch).
What kind of improvement need to be done for Optim.jl so that it can match scipy.optimize where both are using finite difference? I don't think scipy.optimize is doing some special tricks? I think if I use Optim.jl using the same line search that scipy.optimize.minimize(method="L-BFGS-B")
uses, I will observe that Optim.jl is faster.
My model is a discrete-event simulation, so I'm not exactly sure how to derive the gradient.
Usually that can be done with adjoint methods. It will likely be much more efficient than finite differences.
What kind of improvement need to be done for Optim.jl so that it can match scipy.optimize where both are using finite difference? I don't think scipy.optimize is doing some special tricks? I think if I use Optim.jl using the same line search that scipy.optimize.minimize(method="L-BFGS-B") uses, I will observe that Optim.jl is faster
I don't know whether your last sentence is true or not: this is why it would be useful to have a version with explicit gradient to be able to see where the issue comes from. I don't think Optim's first-order optimization methods have been extensively tested/designed with problems without explicit gradient in mind, so it might be something stupid.
I'm
@**Rein Zustand**
on julialang.zulipchat.com who asked a question there on this topic. See @pkofod 's answer at https://julialang.zulipchat.com/#narrow/stream/274208-helpdesk-.28published.29/topic/why.20is.20Optim.2Ejl.20so.20slow.3F/near/227892581. The difference is that this time I have got the permission to publish a subset of my model on GitHub for a concrete context: https://github.com/rht/climate_stress_test_benchmark (the code is ~300 LOC in Julia / Python).I have also asked at Julia Discourse for insights (see https://discourse.julialang.org/t/optim-jl-vs-scipy-optimize-once-again/61661).
The benchmark result:
@btime
The remaining difference between Optim.jl and scipy.optimize seems to be that Optim.jl uses HagerZhang line search, which is probably inherently slower than scipy.optimize's
dcsrch
, which is extracted from Minpack2 with some modification (function definition of the line search algorithm at https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L3326). Some comment (https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L2439-L2480):Which line search algorithm should I use from LineSearches.jl, and with what params so that it is most similar to scipy.optimize's L-BFGS-B? If such line search algorithm does not exist yet in LineSearches.jl, do you think it is worth it to port
dcsrch
to LineSearches.jl?