Unexpected allocations with SVectors

SciML / OrdinaryDiffEq.jl

High performance ordinary differential equation (ODE) and differential-algebraic equation (DAE) solvers, including neural ordinary differential equations (neural ODEs) and scientific machine learning (SciML)

https://diffeq.sciml.ai/latest/

Other

556 stars 210 forks source link

Unexpected allocations with SVectors #518

Closed SebastianM-C closed 5 years ago

SebastianM-C commented 6 years ago

I noticed that with out-of-place integration with SVectors the allocations increase with the integration time with save_everystep=false. MWE:

using DifferentialEquations
using StaticArrays
using BenchmarkTools

@inbounds @inline function ż(z, p, t)
    A, B, D = p
    p₀, p₂ = z[SVector{2}(1:2)]
    q₀, q₂ = z[SVector{2}(3:4)]

    return SVector{4}(
        -A * q₀ - 3 * B / √2 * (q₂^2 - q₀^2) - D * q₀ * (q₀^2 + q₂^2),
        -q₂ * (A + 3 * √2 * B * q₀ + D * (q₀^2 + q₂^2)),
        A * p₀,
        A * p₂
    )
end

@inbounds @inline function ṗ(p, q, params, t)
    A, B, D = params
    dp1 = -A * q[1] - 3 * B / √2 * (q[2]^2 - q[1]^2) - D * q[1] * (q[1]^2 + q[2]^2)
    dp2 = -q[2] * (A + 3 * √2 * B * q[1] + D * (q[1]^2 + q[2]^2))
    return SVector{2}(dp1, dp2)
end

@inbounds @inline function q̇(p, q, params, t)
    params.A * p
end

q0 = SVector{2}([0.0, -4.363920590485035])
p0 = SVector{2}([10.923918825236079, -5.393598858645495])
z0 = vcat(p0, q0)
p = (A=1,B=0.55,D=0.4)

tspan = (0., 10.)

prob1 = ODEProblem(ż, z0, tspan, p)
prob2 = DynamicalODEProblem(ṗ, q̇, p0, q0, tspan, p)

@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10);

# 629.188 μs (39944 allocations: 1.55 MiB)
# 276.165 μs (17752 allocations: 534.27 KiB)
# 1.291 ms (110195 allocations: 3.05 MiB)

@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false);

# 592.659 μs (38972 allocations: 1.20 MiB)
# 264.475 μs (17438 allocations: 462.94 KiB)
# 1.223 ms (108190 allocations: 2.76 MiB)

and increasing the integration time

tspan = (0., 100.)
prob1 = ODEProblem(ż, z0, tspan, p)
prob2 = DynamicalODEProblem(ṗ, q̇, p0, q0, tspan, p)

@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10);

# 6.333 ms (395669 allocations: 15.43 MiB)
# 2.758 ms (177005 allocations: 4.80 MiB)
# 15.447 ms (1100082 allocations: 29.46 MiB)

@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false);

@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false);

# 5.910 ms (386012 allocations: 11.80 MiB)
# 2.650 ms (173906 allocations: 4.43 MiB)
# 14.698 ms (1080082 allocations: 27.47 MiB)

I also included the timings for the full solution for comparison. (See http://nbviewer.jupyter.org/github/SebastianM-C/Benchmarks/blob/master/parallel.ipynb?flush_cache=true for more details)

SebastianM-C commented 6 years ago

I wrote another benchmark comparing the integrator interface with the solve one and if I didn't make any mistakes I have showed that the problem is also present with the integrator interface. See: https://github.com/SebastianM-C/Benchmarks/blob/e808d3d6f3e049627f0fa117795fa1cf4aff1823/integ_vs_solve.ipynb

The relevant part from the above:

@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false);
# 789.911 μs (38974 allocations: 1.21 MiB)
# 355.283 μs (17439 allocations: 463.05 KiB)
# 1.438 ms (108191 allocations: 2.76 MiB)

function integ_benchmark(prob; args...)
    integ = init(prob; args...)
    while integ.t < prob.tspan[2]
        step!(integ)
    end
end

@btime integ_benchmark($prob1, alg=Vern9(), abstol=1e-14, reltol=1e-14)
@btime integ_benchmark($prob2, alg=DPRKN12(), abstol=1e-14, reltol=1e-14)
@btime integ_benchmark($prob2, alg=KahanLi8(), dt=1e-2, maxiters=1e10)
# 897.680 μs (40428 allocations: 1.55 MiB)
# 379.758 μs (17909 allocations: 536.55 KiB)
# 1.563 ms (111196 allocations: 3.06 MiB)

tspan = (0., 100.)
prob1 = ODEProblem(ż, z0, tspan, p)
prob2 = DynamicalODEProblem(ṗ, q̇, p0, q0, tspan, p)

@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false);
# 7.940 ms (386014 allocations: 11.80 MiB)
# 3.480 ms (173907 allocations: 4.43 MiB)
# 17.513 ms (1080083 allocations: 27.47 MiB)

@btime integ_benchmark($prob1, alg=Vern9(), abstol=1e-14, reltol=1e-14)
@btime integ_benchmark($prob2, alg=DPRKN12(), abstol=1e-14, reltol=1e-14)
@btime integ_benchmark($prob2, alg=KahanLi8(), dt=1e-2, maxiters=1e10)
# 8.980 ms (400491 allocations: 15.50 MiB)
# 3.749 ms (178553 allocations: 4.83 MiB)
# 18.589 ms (1110082 allocations: 29.61 MiB)

Note that those benchmarks were done on a different (slower) machine compared with the first ones

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

I am not sure why the benchmarks for the integrator interface give slower timings than the ones for the solve interface. I hope that the benchmark function did not introduce any (grave) performance problems.

Edit: updated the link to point to the relevant file version. I tried some modifications (see master), but I am not sure if I got it right.

YingboMa commented 6 years ago

I opened Julia with --track-allocation=user --inline=no to analyze this issue, however, the only allocation that I found was produced by the derivative function itself.

julia> @timev ż(z0, p, 1.);
  0.000006 seconds (5 allocations: 208 bytes)
elapsed time (ns): 6038
bytes allocated:   208
pool allocs:       5

ChrisRackauckas commented 6 years ago

If changing the timespan doesn't change the allocations in init when doing save_everystep=false, then I don't know if there's anything we can point to? We should check this.

SebastianM-C commented 6 years ago

I added another set of benchmarks with the init and step! separated. I am not sure if I didn't make any mistakes since I get some strange results. See https://github.com/SebastianM-C/Benchmarks/blob/694db1bef9340c12b292714691ce74d8425dac42/integ_vs_solve.ipynb

With tspan = (0.,10.) I get

@btime init($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false)
@btime step_integ(integ1, $tspan[2]) setup=(integ1=init($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false))
@btime init($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false)
@btime step_integ(integ2, $tspan[2]) setup=(integ2=init($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false))
@btime init($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false)
@btime step_integ(integ3, $tspan[2]) setup=(integ3=init($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false))
# 4.335 μs (88 allocations: 18.45 KiB)
# 3.354 μs (223 allocations: 6.98 KiB)
# 4.258 μs (95 allocations: 11.13 KiB)
# 1.114 μs (69 allocations: 1.80 KiB)
# 2.825 μs (79 allocations: 6.03 KiB)
# 120.168 μs (10810 allocations: 281.53 KiB)

and when I increase to tspan=(0.,100.)

@btime init($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false)
@btime step_integ(integ1, $tspan[2]) setup=(integ1=init($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false))
@btime init($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false)
@btime step_integ(integ2, $tspan[2]) setup=(integ2=init($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false))
@btime init($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false)
@btime step_integ(integ3, $tspan[2]) setup=(integ3=init($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false))
# 4.384 μs (86 allocations: 18.34 KiB)
# 5.858 ms (385920 allocations: 11.78 MiB)
# 4.300 μs (94 allocations: 11.03 KiB)
# 308.408 μs (19312 allocations: 502.92 KiB)
# 2.825 μs (78 allocations: 5.94 KiB)
# 14.674 ms (1080000 allocations: 27.47 MiB)

What I find strange is that in the first case the timings are suspiciously small compared with the rest of the benchmarks and in the second case they explode in the case of Vern9.

ChrisRackauckas commented 6 years ago

Okay, that rules out the possibility of something wrong in init. We'd need to check the stepping and the derivative function. There is a possibility that stepping has a problem like https://github.com/JuliaLang/julia/issues/22255

YingboMa commented 6 years ago

using OrdinaryDiffEq
using StaticArrays
using BenchmarkTools
using Profile

@inline function ż(z, p, t)
    @inbounds begin
        @assert z isa SVector
        A, B, D = p
        p₀, p₂ = z[1], z[2]
        q₀, q₂ = z[3], z[4]

        return SVector{4}(
            -A * q₀ - 3 * B / √2 * (q₂^2 - q₀^2) - D * q₀ * (q₀^2 + q₂^2),
            -q₂ * (A + 3 * √2 * B * q₀ + D * (q₀^2 + q₂^2)),
            A * p₀,
            A * p₂
        )
    end
end

q0 = SVector{2}([0.0, -4.363920590485035])
p0 = SVector{2}([10.923918825236079, -5.393598858645495])
z0 = vcat(p0, q0)
p = (A=1,B=0.55,D=0.4)

tspan = (0., 1000.)

prob1 = ODEProblem(ż, z0, tspan, p)

solve(prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@timev solve(prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
Profile.clear_malloc_data()
solve(prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
exit()

When I set tspan = (0., 100.) I got the allocation info

 Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 276)
 Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 277)
 Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 281)
 Coverage.MallocInfo(32, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 278)
 Coverage.MallocInfo(80, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 174)
 Coverage.MallocInfo(80, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 404)
 Coverage.MallocInfo(96, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 378)
 Coverage.MallocInfo(240, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 147)
 Coverage.MallocInfo(320, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 116)
 Coverage.MallocInfo(336, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 178)
 Coverage.MallocInfo(400, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 6)
 Coverage.MallocInfo(480, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/integrators/type.jl.mem", 2)
 Coverage.MallocInfo(624, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/perform_step/verner_rk_perform_step.jl.mem", 640)
 Coverage.MallocInfo(848, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 62)
 Coverage.MallocInfo(3264, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/alg_utils.jl.mem", 361)
 Coverage.MallocInfo(3328, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/interp_func.jl.mem", 4)
 Coverage.MallocInfo(7232, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/integrators/type.jl.mem", 130)

, with tspan=(0., 1000.), I got

 Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 276)
 Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 277)
 Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 281)
 Coverage.MallocInfo(32, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 278)
 Coverage.MallocInfo(80, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 174)
 Coverage.MallocInfo(80, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 404)
 Coverage.MallocInfo(96, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 378)
 Coverage.MallocInfo(240, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 147)
 Coverage.MallocInfo(320, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 116)
 Coverage.MallocInfo(336, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 178)
 Coverage.MallocInfo(400, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 6)
 Coverage.MallocInfo(480, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/integrators/type.jl.mem", 2)
 Coverage.MallocInfo(624, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/perform_step/verner_rk_perform_step.jl.mem", 640)
 Coverage.MallocInfo(848, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 62)
 Coverage.MallocInfo(3264, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/alg_utils.jl.mem", 361)
 Coverage.MallocInfo(3328, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/interp_func.jl.mem", 4)
 Coverage.MallocInfo(7232, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/integrators/type.jl.mem", 130)

They are exactly identical.

So, there isn't anything extra allocated with a longer time span within the OrdinaryDiffEq.jl package. Closing.

ChrisRackauckas commented 6 years ago

How are the allocations in Coverage.MallocInfo(624, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/perform_step/verner_rk_perform_step.jl.mem", 640) not dependent on the number of steps?

ChrisRackauckas commented 6 years ago

@YingboMa , if there's allocations in perform_step!, how is it not dependent on the number of steps?