SciML / NonlinearSolve.jl

High-performance and differentiation-enabled nonlinear solvers (Newton methods), bracketed rootfinding (bisection, Falsi), with sparsity and Newton-Krylov support.
https://docs.sciml.ai/NonlinearSolve/stable/
MIT License
238 stars 42 forks source link

[WIP] reuse option in Newton Raphson method #206

Closed frankschae closed 9 months ago

frankschae commented 1 year ago

TODOs:

codecov[bot] commented 1 year ago

Codecov Report

Merging #206 (093490a) into master (c6c875f) will decrease coverage by 0.36%. The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master     #206      +/-   ##
==========================================
- Coverage   48.45%   48.10%   -0.36%     
==========================================
  Files          19       19              
  Lines        1814     1846      +32     
==========================================
+ Hits          879      888       +9     
- Misses        935      958      +23     
Files Coverage Δ
src/raphson.jl 57.89% <50.00%> (-15.13%) :arrow_down:

:mega: Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

frankschae commented 1 year ago

I had a discussion with @FHoltorf regarding an effective heuristic for updating the Jacobian:

ChrisRackauckas commented 1 year ago

That is precisely the thinking of Hairer II's approach. However, with a lot of benchmarking we have found that the optimal HP1 tends to be ... zero a lot of the time. So basically, just take new Jacobians when you start diverging instead of converging. That is a pretty hard heuristic to beat. In theory you could take it earlier if convergence slows, but we haven't found great schema for that. So for the first version I'd keep it simple just based off of a convergence heuristic set to zero.

ChrisRackauckas commented 1 year ago

You may want to base it off of the reuse heuristics of OrdinaryDiffEq.jl. Take a look at its nonlinear solvers which documents this.

frankschae commented 1 year ago

That is precisely the thinking of Hairer II's approach. However, with a lot of benchmarking we have found that the optimal HP1 tends to be ... zero a lot of the time.

Yeah seems to be the case for the 23 test problems as well (red line below = speed without reuse). I am wondering though why they look so similar.

using NonlinearSolve, LinearAlgebra, LinearSolve, NonlinearProblemLibrary, Test, BenchmarkTools, Plots

problems = NonlinearProblemLibrary.problems
dicts = NonlinearProblemLibrary.dicts

function test_on_library(problems, dicts, ϵ=1e-5)

    samples = 100
    secs = 5

    HP = [1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5]

    # store a list of all Plots
    plts = []

    alg = NewtonRaphson(; reuse=false)
    broken_tests = (1,6) # test problems where method does not converge to small residual

    for (idx, (problem, dict)) in enumerate(zip(problems, dicts))
        title = dict["title"]
        @testset "$title" begin
            # Newton without reuse
            x = dict["start"]
            res = similar(x)
            nlprob = NonlinearProblem(problem, x)
            sol = solve(nlprob, alg, abstol = 1e-18, reltol = 1e-18)
            problem(res, sol.u, nothing)
            broken = idx in broken_tests ? true : false
            @test norm(res)≤ϵ broken=broken

            balg = @benchmarkable solve(nlprob, $alg, abstol=1e-18, reltol=1e-18)
            talg = run(balg, samples=samples, seconds=secs)

            # Newton with reuse
            ts = []
            for reusetol in HP
                @show reusetol
                sol = solve(nlprob, NewtonRaphson(; reuse=true, reusetol=reusetol), abstol=1e-18, reltol=1e-18)
                problem(res, sol.u, nothing)
                broken = idx in broken_tests ? true : false
                @test norm(res) ≤ ϵ broken = broken

                balg2 = @benchmarkable solve(nlprob, NewtonRaphson(; reuse=true, reusetol=$reusetol), abstol=1e-18, reltol=1e-18)
                talg2 = run(balg2, samples=samples, seconds=secs)

                push!(ts, mean(talg2.times) / mean(talg.times))
            end

            pl = scatter(HP, ts, xaxis=:log, xticks=HP, label=false, xlabel="HP", ylabel="mean(time w/ reuse)/mean(time  wo/ reuse)", title=title)
            hline!([1.0], color = "red", label = false)
            display(pl)
            savefig(pl, "Reuse-Newton-" * title * ".png")
            push!(plts, pl)
        end
    end
    return plts
end

# NewtonRaphson
plts = test_on_library(problems, dicts)

# Merge the plots into a single plot
merged_plot = plot(plts..., layout=(5, 5), size=(2500, 2500), margin=10*Plots.mm)
savefig(merged_plot, "Reuse-Newton.png")

Reuse-Newton

So basically, just take new Jacobians when you start diverging instead of converging. That is a pretty hard heuristic to beat.

So only recomputing when the residual increases?

You may want to base it off of the reuse heuristics of OrdinaryDiffEq.jl. Take a look at its nonlinear solvers which documents this.

I only scanned quickly through https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/src/nlsolve/newton.jl but didn't immediately see where the reuse happens. Is it in there?

ChrisRackauckas commented 1 year ago

So only recomputing when the residual increases?

Yup.

but didn't immediately see where the reuse happens. Is it in there?

https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/src/nlsolve/nlsolve.jl#L63-L96

FHoltorf commented 1 year ago

One heuristic (or part of one) that might be interesting to also try and incorporate is to recompute Jacobians when the Newton direction associated with the old Jacobian stops being a descent direction for the residual norm. If for nothing else but the fact that it would leverage the AD system nicely.

One would need to compute the gradient of the residual norm with reverse mode AD but, even though more expensive then just checking the residual, it would make things play nicely with linesearch. (If you don't make any of such checks, I am guessing you could run into trouble if you want to combine Jacobian reuse with linesearch?)

ChrisRackauckas commented 1 year ago

Ahh yes that would be an interesting approach to try.

frankschae commented 12 months ago

For the gradient of the residual norm, I think we would need to change the default norm.

https://github.com/SciML/DiffEqBase.jl/blob/master/src/common_defaults.jl#L26C1-L33C4 drops the gradient

function res_norm1(u)
    b = f(u, p)
    sqrt(sum(abs2, b) / length(u))
end

function res_norm2(u)
    b = f(u, p)
    DiffEqBase.ODE_DEFAULT_NORM(b, nothing)
end

ForwardDiff.gradient(res_norm2, u0) # => zero(u0)

(Why is it actually sqrt(sum(abs2, b) / length(u)) and not sqrt(sum(abs2, b)) / length(u)?

ChrisRackauckas commented 12 months ago

(Why is it actually sqrt(sum(abs2, b) / length(u)) and not sqrt(sum(abs2, b)) / length(u)?

That's just matching Hairer, I don't think it really matters which one it is.

https://github.com/SciML/DiffEqBase.jl/blob/master/src/common_defaults.jl#L26C1-L33C4 drops the gradient

We just shouldn't be differentiating the solvers here at all

ChrisRackauckas commented 12 months ago

how did the AD thing come up?

frankschae commented 12 months ago

One heuristic (or part of one) that might be interesting to also try and incorporate is to recompute Jacobians when the Newton direction associated with the old Jacobian stops being a descent direction for the residual norm. If for nothing else but the fact that it would leverage the AD system nicely.

One would need to compute the gradient of the residual norm with reverse mode AD but, even though more expensive then just checking the residual, it would make things play nicely with linesearch. (If you don't make any of such checks, I am guessing you could run into trouble if you want to combine Jacobian reuse with linesearch?)

Only for testing this idea, which might be necessary for Jacobian reuse if $\Delta x$ is not a decent direction. So I thought we'd need to compare $\Delta x$ (with the reused Jacobian) to $\frac{d ||f(x,p)||} {dx}$.

ChrisRackauckas commented 12 months ago

ahh yeah, the ODE one purposely changes away from differentiating the norm, which isn't what you want here.

frankschae commented 10 months ago

@ChrisRackauckas: @yonatanwesen, @FHoltorf, and I were talking again about a good example today. Was there any specific ODE system for which the reuse strategy in https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/src/nlsolve/nlsolve.jl#L63-L96 was particularly good? (To have another starting point.)

ChrisRackauckas commented 10 months ago

Anything sufficiently large. Bruss with N=32 should do.

avik-pal commented 10 months ago

https://github.com/SciML/SciMLBenchmarks.jl/pull/796 has a benchmarking script. Drop the sundials and minpack versions and you should be able to test quite rapidly

avik-pal commented 10 months ago

https://sundials.readthedocs.io/en/latest/kinsol/Mathematics_link.html#jacobian-information-update-strategy sundials also reuses jacobian

avik-pal commented 10 months ago

though their scheme might not be ideal, given how long it takes for bruss

yonatanwesen commented 10 months ago

The results from benchmarking what we currently we have on 2d brusselator problem doesn't show any significant improvement with the jacobian reuse. brusselator_scaling

ChrisRackauckas commented 10 months ago

what's the cost breakdown like?

yonatanwesen commented 10 months ago

what's the cost breakdown like?

Do you mean the cost when I profile it?

ChrisRackauckas commented 10 months ago

Yes share some flamegraphs https://github.com/tkluck/StatProfilerHTML.jl

yonatanwesen commented 10 months ago

Yes share some flamegraphs https://github.com/tkluck/StatProfilerHTML.jl

Newton-reuse.zip Newton-without-reuse.zip

avik-pal commented 10 months ago

Check the stats and trace to see how many times the factorization is reused

avik-pal commented 9 months ago

https://github.com/SciML/NonlinearSolve.jl/pull/345 supercedes this