ModelingToolkit slower than native OrdinaryDiffEq?

oameye commented 8 months ago

Question❓ When doing a basic benchmark for a Meier-Stein model, I find that giving the bare function to OrdinaryDiffEq is faster than doing it through ModelingToolkit. However, should the ModelingToolkit version not be more performant as it has acces to the Jacobian? Is the Symbolic layer of ModelingToolkit a bottleneck for the solver? Thank you for all the help you can give me to understand the result.

using ModelingToolkit, OrdinaryDiffEq, BenchmarkTools, StaticArrays

@variables t u(t) v(t)
D = Differential(t)

eqs = [D(u) ~ u - u^3 - 10*u*v^2,
       D(v) ~ -(1 + u^2)*v]

@named sys = ODESystem(eqs)
sys = structural_simplify(sys)

u0 = [u => 0.2,
      v => 0.3]

function meier_stein!(dx, x, p, t) # in-place
    u, v = x
    dx[1] = u - u^3 - 10*u*v^2
    dx[2] = -(1+u^2)*v
end
function meier_stein(u, p, t) # out-of-place
    u, v = x
    du =u - u^3 - 10*u*v^2,
    dv = -(1 + u^2)*v
    SA[du, dv]
end

tspan = (0.0, 10_000.0)
prob_od = ODEProblem{false}(meier_stein, SVector{2}(getindex.(u0,2)), tspan, ())
prob_od! = ODEProblem{true}(meier_stein!, getindex.(u0,2), tspan, ())
prob_mt_jac = ODEProblem{false}(sys, u0, tspan, []; jac = true, u0_constructor = x -> SVector(x...))
prob_mt_jac! = ODEProblem{true}(sys, u0, tspan, []; jac = true)
prob_mt = ODEProblem{false}(sys, u0, tspan, []; jac = false, u0_constructor = x -> SVector(x...))
prob_mt! = ODEProblem{true}(sys, u0, tspan, []; jac = false)

@btime solve($prob_od, Tsit5(), saveat=5); # 694.000 μs (25 allocations: 15.58 KiB)
@btime solve($prob_od!, Tsit5(), saveat=5); # 2.176 ms (3760 allocations: 86.14 KiB)
@btime solve($prob_mt_jac, Tsit5(), saveat=5); # 1.287 ms (27 allocations: 119.72 KiB)
@btime solve($prob_mt_jac!, Tsit5(), saveat=5); # 1.688 ms (2045 allocations: 261.81 KiB)
@btime solve($prob_mt, Tsit5(), saveat=5); # 1.316 ms (27 allocations: 119.72 KiB)
@btime solve($prob_mt!, Tsit5(), saveat=5); # 1.983 ms (2055 allocations: 262.78 KiB)

oameye commented 8 months ago

It depends heavily on the system. For a Kerr Parametric Oscillator (KPO), ModelingToolkit is faster again. Likely because OrdinaryDiffEq does some optimization for the Maier-Stein model, something it cannot seem to do for a KPO.

using ModelingToolkit, OrdinaryDiffEq, BenchmarkTools, StaticArrays

γ, α, ω₀, ω, λ = 0.01, 1.0, 1.0, 1.0, 0.6
@variables t u(t) v(t)
D = Differential(t)

eqs = [D(u) ~ -γ * u / 2 - (3 * α * (u^2 + v^2) / 8 + (ω₀^2 - ω^2) / 2 + λ / 4) * v / ω,
    D(v) ~ -γ * v / 2 + (3 * α * (u^2 + v^2) / 8 + (ω₀^2 - ω^2) / 2 - λ / 4) * u / ω]

@named sys = ODESystem(eqs)
sys = structural_simplify(sys)

u0 = [u => 0.2,
    v => 0.3]

function KPO!(dx, x, p, t) # in-place
    u, v = x
    dx[1] = -γ * u / 2 - (3 * α * (u^2 + v^2) / 8 + (ω₀^2 - ω^2) / 2 + λ / 4) * v / ω
    dx[2] = -γ * v / 2 + (3 * α * (u^2 + v^2) / 8 + (ω₀^2 - ω^2) / 2 - λ / 4) * u / ω
end
function KPO(x, p, t) # out-of-place
    u, v = x
    du = -γ * u / 2 - (3 * α * (u^2 + v^2) / 8 + (ω₀^2 - ω^2) / 2 + λ / 4) * v / ω
    dv = -γ * v / 2 + (3 * α * (u^2 + v^2) / 8 + (ω₀^2 - ω^2) / 2 - λ / 4) * u / ω
    SA[du, dv]
end

tspan = (0.0, 10_000.0)
prob_od = ODEProblem{false}(KPO, SVector{2}(getindex.(u0, 2)), tspan, ())
prob_od! = ODEProblem{true}(KPO!, getindex.(u0, 2), tspan, ())
prob_mt_jac = ODEProblem{false}(sys, u0, tspan, []; jac=true, u0_constructor=x -> SVector(x...))
prob_mt_jac! = ODEProblem{true}(sys, u0, tspan, []; jac=true)
prob_mt = ODEProblem{false}(sys, u0, tspan, []; jac=false, u0_constructor=x -> SVector(x...))
prob_mt! = ODEProblem{true}(sys, u0, tspan, []; jac=false)

@btime solve($prob_od, Tsit5(), saveat=5); # 14.263 ms (616018 allocations: 12.20 MiB)
@btime solve($prob_od!, Tsit5(), saveat=5); # 6.766 ms (398880 allocations: 6.31 MiB)
@btime solve($prob_mt_jac, Tsit5(), saveat=5); # 595.200 μs (27 allocations: 119.72 KiB)
@btime solve($prob_mt_jac!, Tsit5(), saveat=5); # 694.200 μs (2045 allocations: 261.81 KiB)
@btime solve($prob_mt, Tsit5(), saveat=5); # 598.000 μs (27 allocations: 119.72 KiB)
@btime solve($prob_mt!, Tsit5(), saveat=5); # 766.100 μs (2055 allocations: 262.78 KiB)

ChrisRackauckas commented 8 months ago

Symbolics doesn't specialize the codegen of StaticArrays appropriately. This is something I've surfaced a few times but it wasn't fixed yet. @AayushSabharwal can you have a look when you can? Effectively, you want to not do any of the extra conversions that it currently does, the compiler does not elide them.

ChrisRackauckas commented 5 months ago

@AayushSabharwal bump on this, looking at static array codegen.

AayushSabharwal commented 5 months ago

I've found something weird. This is the out-of-place function from prob_mt_jac:

julia> prob_mt_jac.f.f.f_oop
RuntimeGeneratedFunction(#=in ModelingToolkit=#, #=using ModelingToolkit=#, :((ˍ₋arg1, t)->begin
          #= /Users/aayush/.julia/packages/SymbolicUtils/0opve/src/code.jl:373 =#
          #= /Users/aayush/.julia/packages/SymbolicUtils/0opve/src/code.jl:374 =#
          #= /Users/aayush/.julia/packages/SymbolicUtils/0opve/src/code.jl:375 =#
          begin
              begin
                  begin
                      #= /Users/aayush/.julia/packages/SymbolicUtils/0opve/src/code.jl:468 =#
                      (SymbolicUtils.Code.create_array)(typeof(ˍ₋arg1), nothing, Val{1}(), Val{(2,)}(), (+)((+)(ˍ₋arg1[1], (*)(-1, (^)(ˍ₋arg1[1], 3))), (*)((*)(-10, ˍ₋arg1[1]), (^)(ˍ₋arg1[2], 2))), (*)((+)(-1, (*)(-1, (^)(ˍ₋arg1[1], 2))), ˍ₋arg1[2]))
                  end
              end
          end
      end))

If I copy this expression and benchmark it (replacing arg1 with x)

julia> bar = (x, t)->begin
                 (SymbolicUtils.Code.create_array)(typeof(x), nothing, Val{1}(), Val{(2,)}(), (+)((+)(x[1], (*)(-1, (^)(x[1], 3))), (*)((*)(-10, x[1]), (^)(x[2], 2))), (*)((+)(-1, (*)(-1, (^)(x[1], 2))), x[2]))
               end
julia> @be (su, 0.0) sum(bar(_...))
Benchmark: 2579 samples with 343 evaluations
min    38.749 ns (4.02 allocs: 96.560 bytes)
median 65.720 ns (4.02 allocs: 96.560 bytes)
mean   100.692 ns (4.02 allocs: 96.560 bytes, 0.04% gc time)
max    64.061 μs (4.02 allocs: 96.560 bytes, 99.77% gc time)

Which allocates. meier_stein doesn't:

julia> @be (su, nothing, 0.0) sum(meier_stein(_...))
Benchmark: 4754 samples with 7305 evaluations
min    1.865 ns
median 2.453 ns
mean   2.636 ns
max    16.524 ns

However, if I put this expression into a function instead of a closure and run the benchmark

julia> function foo(x, t)
           (SymbolicUtils.Code.create_array)(typeof(x), nothing, Val{1}(), Val{(2,)}(), (+)((+)(x[1], (*)(-1, (^)(x[1], 3))), (*)((*)(-10, x[1]), (^)(x[2], 2))), (*)((+)(-1, (*)(-1, (^)(x[1], 2))), x[2]))
       end
julia> @be (su, 0.0) sum(foo(_...))
Benchmark: 4732 samples with 8806 evaluations
min    1.864 ns
median 1.878 ns
mean   2.203 ns
max    10.746 ns

It is just as fast as meier_stein.

So far, this is the only explanation I've found for this behavior. Even reducing bar to something practically identical to meier_stein suffers slowdowns, but foo doesn't:

julia> bar = (₋arg1, t)->begin
                           u, v = ₋arg1
                           du = (+)(u, - (u ^ 3), (*)(-10, u, v ^ 2));
                           dv= (*)(-1, (+)(1, (^)(u, 2)), v);
                           return SA[du, dv]
             end
julia> @be (su, 0.0) sum(bar(_...))
Benchmark: 2581 samples with 250 evaluations
min    55.500 ns (4.03 allocs: 96.768 bytes)
median 91.332 ns (4.03 allocs: 96.768 bytes)
mean   126.129 ns (4.03 allocs: 96.768 bytes)
max    17.554 μs (4.03 allocs: 96.768 bytes)

For reference, the @code_lowered outputs for all three:

julia> @code_lowered foo(su, nothing, 0.0)
CodeInfo(
1 ─ %1  = Base.getproperty(Main.SymbolicUtils, :Code)
│   %2  = Base.getproperty(%1, :create_array)
│   %3  = Main.typeof(x)
│   %4  = Main.nothing
│   %5  = Core.apply_type(Main.Val, 1)
│   %6  = (%5)()
│   %7  = Main.Val
│   %8  = Core.tuple(2)
│   %9  = Core.apply_type(%7, %8)
│   %10 = (%9)()
│   %11 = Base.getindex(x, 1)
│   %12 = Main.:^
│   %13 = Base.getindex(x, 1)
│   %14 = Core.apply_type(Base.Val, 3)
│   %15 = (%14)()
│   %16 = Base.literal_pow(%12, %13, %15)
│   %17 = -1 * %16
│   %18 = %11 + %17
│   %19 = Base.getindex(x, 1)
│   %20 = -10 * %19
│   %21 = Main.:^
│   %22 = Base.getindex(x, 2)
│   %23 = Core.apply_type(Base.Val, 2)
│   %24 = (%23)()
│   %25 = Base.literal_pow(%21, %22, %24)
│   %26 = %20 * %25
│   %27 = %18 + %26
│   %28 = Main.:^
│   %29 = Base.getindex(x, 1)
│   %30 = Core.apply_type(Base.Val, 2)
│   %31 = (%30)()
│   %32 = Base.literal_pow(%28, %29, %31)
│   %33 = -1 * %32
│   %34 = -1 + %33
│   %35 = Base.getindex(x, 2)
│   %36 = %34 * %35
│   %37 = (%2)(%3, %4, %6, %10, %27, %36)
└──       return %37
)

julia> @code_lowered bar(su, 0.0)
CodeInfo(
1 ─ %1  = Base.indexed_iterate(₋arg1, 1)
│         u = Core.getfield(%1, 1)
│         @_4 = Core.getfield(%1, 2)
│   %4  = Base.indexed_iterate(₋arg1, 2, @_4)
│         v = Core.getfield(%4, 1)
│   %6  = u
│   %7  = Main.:^
│   %8  = u
│   %9  = Core.apply_type(Base.Val, 3)
│   %10 = (%9)()
│   %11 = Base.literal_pow(%7, %8, %10)
│   %12 = -%11
│   %13 = u
│   %14 = Main.:^
│   %15 = v
│   %16 = Core.apply_type(Base.Val, 2)
│   %17 = (%16)()
│   %18 = Base.literal_pow(%14, %15, %17)
│   %19 = -10 * %13 * %18
│         du = %6 + %12 + %19
│   %21 = Main.:^
│   %22 = u
│   %23 = Core.apply_type(Base.Val, 2)
│   %24 = (%23)()
│   %25 = Base.literal_pow(%21, %22, %24)
│   %26 = 1 + %25
│         dv = -1 * %26 * v
│   %28 = Base.getindex(Main.SA, du, dv)
└──       return %28

julia> @code_lowered meier_stein(su, nothing, 0.0)
CodeInfo(
1 ─ %1  = Base.indexed_iterate(x, 1)
│         u = Core.getfield(%1, 1)
│         @_5 = Core.getfield(%1, 2)
│   %4  = Base.indexed_iterate(x, 2, @_5)
│         v = Core.getfield(%4, 1)
│   %6  = u
│   %7  = Main.:^
│   %8  = u
│   %9  = Core.apply_type(Base.Val, 3)
│   %10 = (%9)()
│   %11 = Base.literal_pow(%7, %8, %10)
│   %12 = %6 - %11
│   %13 = u
│   %14 = Main.:^
│   %15 = v
│   %16 = Core.apply_type(Base.Val, 2)
│   %17 = (%16)()
│   %18 = Base.literal_pow(%14, %15, %17)
│   %19 = 10 * %13 * %18
│         du = %12 - %19
│   %21 = Main.:^
│   %22 = u
│   %23 = Core.apply_type(Base.Val, 2)
│   %24 = (%23)()
│   %25 = Base.literal_pow(%21, %22, %24)
│   %26 = 1 + %25
│   %27 = -%26
│         dv = %27 * v
│   %29 = Base.getindex(Main.SA, du, dv)
└──       return %29
)

They are practically identical

AayushSabharwal commented 5 months ago

Even weirder:


julia> expr = :(function foo(x, t)
                  (SymbolicUtils.Code.create_array)(typeof(x), nothing, Val{1}(), Val{(2,)}(), (+)((+)(x[1], (*)(-1, (^)(x[1], 3))), (*)((*)(-10, x[1]), (^)(x[2], 2))), (*)((+)(-1, (*)(-1, (^)(x[1], 2))), x[2]))
              end)

julia> @be (su, nothing, 0.0) sum(meier_stein(_...))
Benchmark: 4335 samples with 8773 evaluations
min    1.867 ns
median 2.341 ns
mean   2.430 ns
max    9.456 ns

julia> fn = @RuntimeGeneratedFunction(expr)
julia> @be (su, 0.0) fn(_...)
Benchmark: 2525 samples with 473 evaluations
min    30.302 ns (3.02 allocs: 80.474 bytes)
median 50.651 ns (3.02 allocs: 80.474 bytes)
mean   75.703 ns (3.02 allocs: 80.474 bytes)
max    13.093 μs (3.02 allocs: 80.474 bytes)

julia> @be (@RuntimeGeneratedFunction(expr), su, 0.0) _[1](_[2], _[3])
Benchmark: 767 samples with 8699 evaluations
min    1.868 ns
median 2.381 ns
mean   2.513 ns
max    12.061 ns

SciML / ModelingToolkit.jl

ModelingToolkit slower than native OrdinaryDiffEq? #2474