Integration performance optimization: de-broadcast dipole integration

SunnySuite / Sunny.jl

Spin dynamics and generalization to SU(N) coherent states

Other

84 stars 19 forks source link

Integration performance optimization: de-broadcast dipole integration #258

Open Lazersmoke opened 5 months ago

Lazersmoke commented 5 months ago

Broadcasting can be (very) slow compared to the for loops this PR introduces, especially in cases where the size of the array is something like 1x1x1x1 (requires 4 dimensions of broadcasting codegen, bounds checks, etc that doesn't need to be there). You can diagnose this is happening if you see lots of calls to broadcast.jl, materialize!: line ### in the @profview profiling window originating from lines with @. inside Integrators.jl, #step!.

For my benchmark (not compatible with Sunny#main) I get a modest speedup (attached screenshot)

Lazersmoke commented 5 months ago

Minimal code to reproduce:

using Sunny, LinearAlgebra, Chairmarks

cryst = Crystal(I(3), [[0.,0,0]],1)
sys = System(cryst,(1,1,1), [SpinInfo(1;S=1/2, g=2.5)], :dipole)
set_external_field!(sys,[0,0.3,0])
int = ImplicitMidpoint(0.001; damping=0.1, atol=1e-12)
@b (for j = 1:100_000; step!(sys,int); end)

kbarros commented 5 months ago

This is an interesting observation. To some extent I can reproduce the slowdown.

Here is a simpler stand-alone test we could report to Julia developers. Interestingly, it is not always the case that broadcasting is bad. In fact, sometimes it beats for loops.


using StaticArrays, Chairmarks
const Vec3 = SVector{3, Float64}

# An extreme case: Broadcasting over vectors of length 1.
Δs = randn(Vec3, 1)
s = randn(Vec3, 1)
∇E = randn(Vec3, 1)

# For loops win for this particular kernel
function f1(Δs, s, ∇E)
    for iters in 1:10_000
        for i in eachindex(Δs)
            Δs[i] = -s[i] × (s[i] × ∇E[i])
        end
    end
end
function f2(Δs, s, ∇E)
    for iters in 1:10_000
        @. Δs = -s × (s × ∇E)
    end
end
@b f1(Δs, s, ∇E) # 28.625 μs
@b f2(Δs, s, ∇E) # 34.041 μs

# Broadcasting (very slightly) wins for a simpler kernel
function g1(Δs, s, ∇E)
    for iters in 1:10_000
        for i in eachindex(Δs)
            Δs[i] = -s[i] × ∇E[i]
        end
    end
end
function g2(Δs, s, ∇E)
    for iters in 1:10_000
        @. Δs = -s × ∇E
    end
end
@b g1(Δs, s, ∇E) # 25.500 μs
@b g2(Δs, s, ∇E) # 23.958 μs

Lazersmoke commented 5 months ago

This appears to depend mainly on the number of [i] (i.e. how many bounds checks). For our case at least, it can be very efficient to use MVector for Δs and SVector for s; I get 4x speedup on g1 in Kip's benchmark. But it means we have to get the typing information in there somewhere. I'm not sure we want to store this in the type parameters of e.g. System, but if there's a way to convert sys.dipoles into an MVector before passing to rhs_dipole!, that would be highly highly advantageous.

kbarros commented 2 months ago

Might be worth also experimenting with this: https://github.com/YingboMa/FastBroadcast.jl.