Open carstenbauer opened 3 years ago
Since I know that you're interested in these things, maybe a bit of background. A C++ friend gave me this code (.txt because GitHub doesn't allow .cpp):
and told me that his Julia code was much slower (>5x) (again .txt because .jl isn't allowed):
I went ahead and tried to reproduce the timings and (using icc -O3 -I ~/.local/lib/eigen/ Ising2D.cpp -o Ising2D
) got this entirely different picture:
➜ ~/Desktop/ising julia 2dising.jl
-0.4428173828125
-0.4118935546875
-0.436130859375
-0.460140625
-0.4394482421875
-0.4422109375
-0.4049609375
3.088 s (14 allocations: 822.06 KiB)
➜ ~/Desktop/ising time ./Ising2D
-0.423299
./Ising2D 9.22s user 0.01s system 99% cpu 9.234 total
I asked him what he's doing differently and he told me that he is using icpx
instead of icc
or clang
. For him, this makes the a interestingly HUGE difference:
cecri@cecri-Mint:~/work/Practices/2dIsing$ clang++-9 -O3 -march=native Ising2D.cpp
cecri@cecri-Mint:~/work/Practices/2dIsing$ time ./a.out
-0.434271
real 0m3.337s
user 0m3.330s
sys 0m0.004s
cecri@cecri-Mint:~/work/Practices/2dIsing$ icpx -Ofast -xHost Ising2D.cpp
cecri@cecri-Mint:~/work/Practices/2dIsing$ time ./a.out
-0.406395
real 0m0.667s
user 0m0.663s
sys 0m0.004s
So, while Julia is competitive to clang
/icc
, perhaps even faster, icpx
crushes everyone else! My thought then was, alright I know a package that is known for automagical performance speedups, which is why I tried LoopVectorization.jl after optimising the code a bit (40% speedup on my machine):
using Random
using BenchmarkTools
const rng = MersenneTwister()
function get(spin_conf, x, y)
(Nx, Ny) = size(spin_conf)
@inbounds spin_conf[(x-1+Nx)%Nx + 1, (y-1+Ny)%Ny + 1]
end
function deltaE(spin_conf, i, j)
return 2.0*get(spin_conf,i,j) * (get(spin_conf,i-1,j) + get(spin_conf,i+1,j) +
get(spin_conf,i,j+1) + get(spin_conf,i,j-1))
end
function energy(spin_conf)
(Nx, Ny) = size(spin_conf)
res = 0
for i = 1:Nx
@simd for j = 1:Ny
res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
end
end
return res
end
magnetism(spin_conf) = sum(spin_conf)
function sweep!(rec, spin_conf, beta, M)
(Nx, Ny) = size(spin_conf)
@inbounds @simd for i = 1:M
x = rand(rng, 1:Nx)
y = rand(rng, 1:Ny)
r = rand(rng)
dE = deltaE(spin_conf, x, y)
if exp(-beta*dE) > r
spin_conf[x,y] *= -1
end
rec[i] = energy(spin_conf)
end
return rec
end
function main()
beta = 0.2
L = 64
total_iteration = 100_000
spin_conf = rand(rng, (-1,1), L, L)
rec = Vector{Int64}(undef,total_iteration)
sweep!(rec, spin_conf, beta, total_iteration)
s = sum(last(rec, 1000))/1000
s /= L^2
println(s)
return nothing
end
@btime main()
This is the background for this issue and any help in getting the magical performance of icpx
in Julia would be appreciated :)
(Update: g++
can also give the "magic" when run with the -fwhole-program
compiler flag)
I realised that when I use the somewhat outdated @unroll
macro from https://github.com/StephenVavasis/Unroll.jl/blob/master/src/Unroll.jl and hardcode the loop length (i.e. replace Ny
by 64) like so
function energy(spin_conf)
(Nx, Ny) = size(spin_conf)
res = 0
for i = 1:Nx
@unroll for j = 1:64
res += -get(spin_conf,i,j)*(get(spin_conf,i+1,j) + get(spin_conf,i,j+1))
end
end
return res
end
I get similar "magical" performance in Julia:
julia> @btime main();
459.902 ms (5 allocations: 821.34 KiB)
While this is great I had to use a macro from an unmaintained package and hardcode the iterator length. LoopVectorization would hopefully make this experience nicer. :)
C++ code compiled with g++
and -fwhole-program
for comparison:
➜ ~/Desktop/ising cat compile_gcc.sh
g++-10 -O3 -march=native -funroll-loops -fwhole-program -I ~/.local/lib/eigen/ Ising2D.cpp -o Ising2D
➜ ~/Desktop/ising time ./Ising2D
-0.370169
./Ising2D 0.43s user 0.00s system 99% cpu 0.430 total
(Update: Unfortunately, my friend told me that if he does the same and puts a #pragma GCC unroll 64
in front of the loop he gets
cecri@cecri-Mint:~/work/Practices/2dIsing$ g++ -std=gnu++17 -O3 -march=native -ffast-math -fwhole-program Ising2D.cpp
cecri@cecri-Mint:~/work/Practices/2dIsing$ time ./a.out
-0.410533
real 0m0.179s
user 0m0.176s
sys 0m0.004s
So still more room for optimisation on the Julia side....)
Regarding the vrem_fast
, that's a VectorizationBase bug.
julia> vxi = Vec(ntuple(Int,VectorizationBase.pick_vector_width(Int))...)
Vec{8, Int64}<1, 2, 3, 4, 5, 6, 7, 8>
julia> vxi2 = Vec(ntuple(_ -> rand(Int),VectorizationBase.pick_vector_width(Int))...)
Vec{8, Int64}<3175761247964132865, -7335689000808951881, 577807031437413038, -6325572139043877891, -55170765264911246, -6016006685313364129, -5181912736038640374, 6721328960625882679>
julia> vxi2 % vxi
Vec{8, Int64}<0, -1, 2, -3, -1, -1, -3, 7>
julia> @fastmath vxi2 % vxi
Vec{8, Int64}<0, -1, 2, -3, -1, -1, -3, 7>
julia> VectorizationBase.vrem_fast(vxi2, vxi)
ERROR: MethodError: no method matching vrem_fast(::Vec{8, Int64}, ::Vec{8, Int64})
Stacktrace:
[1] top-level scope
@ REPL[8]:1
julia> VectorizationBase.vrem(vxi2, vxi)
Vec{8, Int64}<0, -1, 2, -3, -1, -1, -3, 7>
Something's going wrong in the dispatch pipeline, but an obvious fix is to just define the fallback method in VectorizationBase:
@inline vrem_fast(a,b) = vrem(a,b)
The indexing issue would be a little trickier, but I should define getindex
and setindex!
methods for some AbstractArray
types (those that support stridedpointer
).
I'll try and take a look at everything else later. Definitely a fan of winning benchmarks. =)
However, I don't think LoopVectorization will do the right thing here, and it may be that gcc
is more clever.
function energy(spin_conf)
(Nx, Ny) = size(spin_conf)
res = 0
@avx for i = 1:Nx
for j = 1:Ny
i0 = (i-1+Nx)%Nx
i1 = (i+Nx)%Nx
j0 = (j-1+Ny)%Ny
j1 = (j+Ny)%Ny
res += -spin_conf[i0,j0]*(spin_conf[i1,j0] + spin_conf[i0,j1])
end
end
return res
end
Basically, you need to get rid of the %Nx
and %Ny
by breaking up the iteration space. I'm guessing this is what gcc
and icpx
do, but is not an optimization I've implemented in LoopVectorization yet.
The splitting should be really straightforward to do manually.
Thanks for the hints. Perhaps a better formulation of the benchmark challenge 😄 : https://discourse.julialang.org/t/performance-optimisation-julia-vs-c/59689
I got into a somewhat similar situation with x % UInt8
when x
is an Int8
. I got this error message:
ERROR: MethodError: no method matching vrem_fast(::VectorizationBase.Vec{32, Int8}, ::Type{UInt8})
Stacktrace:
[1] rem_fast(v::VectorizationBase.Vec{32, Int8}, #unused#::Type{UInt8})
I guess this is a missing method, but when I extended vrem_fast
as you suggested upthread, @chriselrod , I got an ERROR: InexactError: trunc(UInt8, 256)
instead. Without @avx
it works fine.
MWE:
Trying to run
energy(spin_conf)
I get"Manually inlining" the
get
function:I obtain