Open Lazersmoke opened 5 months ago
Minimal code to reproduce:
using Sunny, LinearAlgebra, Chairmarks
cryst = Crystal(I(3), [[0.,0,0]],1)
sys = System(cryst,(1,1,1), [SpinInfo(1;S=1/2, g=2.5)], :dipole)
set_external_field!(sys,[0,0.3,0])
int = ImplicitMidpoint(0.001; damping=0.1, atol=1e-12)
@b (for j = 1:100_000; step!(sys,int); end)
This is an interesting observation. To some extent I can reproduce the slowdown.
Here is a simpler stand-alone test we could report to Julia developers. Interestingly, it is not always the case that broadcasting is bad. In fact, sometimes it beats for loops.
using StaticArrays, Chairmarks
const Vec3 = SVector{3, Float64}
# An extreme case: Broadcasting over vectors of length 1.
Δs = randn(Vec3, 1)
s = randn(Vec3, 1)
∇E = randn(Vec3, 1)
# For loops win for this particular kernel
function f1(Δs, s, ∇E)
for iters in 1:10_000
for i in eachindex(Δs)
Δs[i] = -s[i] × (s[i] × ∇E[i])
end
end
end
function f2(Δs, s, ∇E)
for iters in 1:10_000
@. Δs = -s × (s × ∇E)
end
end
@b f1(Δs, s, ∇E) # 28.625 μs
@b f2(Δs, s, ∇E) # 34.041 μs
# Broadcasting (very slightly) wins for a simpler kernel
function g1(Δs, s, ∇E)
for iters in 1:10_000
for i in eachindex(Δs)
Δs[i] = -s[i] × ∇E[i]
end
end
end
function g2(Δs, s, ∇E)
for iters in 1:10_000
@. Δs = -s × ∇E
end
end
@b g1(Δs, s, ∇E) # 25.500 μs
@b g2(Δs, s, ∇E) # 23.958 μs
This appears to depend mainly on the number of [i]
(i.e. how many bounds checks). For our case at least, it can be very efficient to use MVector
for Δs
and SVector
for s
; I get 4x speedup on g1
in Kip's benchmark. But it means we have to get the typing information in there somewhere. I'm not sure we want to store this in the type parameters of e.g. System
, but if there's a way to convert sys.dipoles
into an MVector
before passing to rhs_dipole!
, that would be highly highly advantageous.
Might be worth also experimenting with this: https://github.com/YingboMa/FastBroadcast.jl.
Broadcasting can be (very) slow compared to the
for
loops this PR introduces, especially in cases where the size of the array is something like1x1x1x1
(requires 4 dimensions of broadcasting codegen, bounds checks, etc that doesn't need to be there). You can diagnose this is happening if you see lots of calls tobroadcast.jl, materialize!: line ###
in the@profview
profiling window originating from lines with@.
insideIntegrators.jl, #step!
.For my benchmark (not compatible with Sunny#main) I get a modest speedup (attached screenshot)