Closed baggepinnen closed 4 years ago
You should using SLEEFPirates
to get the SIMD exp
function, or you can use SIMDPirates.vexp
, which is a different implementation (but normally slower, I've found).
SIMD.jl
is cheating this benchmark. It uses @llvm.exp
, which LLVM knows it can hoist out of loops, etc, given constant arguments.
using SIMDPirates, SIMD, BenchmarkTools, SLEEFPirates, Test
@fastmath function softmin_julia(a::T,b,c, γ) where T
γ = -γ
a,b,c = a/γ, b/γ, c/γ
maxv = max(a,b,c)
ae,be,ce = exp(a - maxv), exp(b - maxv), exp(c - maxv)
γ*(log(ae+be+ce) + maxv)
end
function softmin_simdp(a::T,b,c, γ) where T
γ = -γ
v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
v = v ./ γ
maxv = maximum(v)
ve = exp(v - maxv) * SIMDPirates.SVec{4,T}((one(T), one(T), one(T), zero(T)))
γ*(log(sum(ve)) + maxv)
end
function softmin_simd(a::T,b,c, γ) where T
γ = -γ
v = SIMD.Vec{4,T}((a, b, c, zero(T)))
v = v / γ
maxv = maximum(v)
ve = exp(v - maxv) * SIMD.Vec{4,T}((one(T), one(T), one(T), zero(T)))
γ*(log(sum(ve)) + maxv)
end
@test softmin_julia(1.0, 1.0, 1.0, 1.0) ≈ softmin_simd(1.0, 1.0, 1.0, 1.0) ≈ softmin_simdp(1.0, 1.0, 1.0, 1.0) # Passes
Reproducing your results:
julia> @btime softmin_julia(1.0, 1.0, 1.0, 1.0)
6.987 ns (0 allocations: 0 bytes)
-0.09861228866810956
julia> @btime softmin_simd(1.0, 1.0, 1.0, 1.0)
5.027 ns (0 allocations: 0 bytes)
-0.09861228866810973
julia> @btime softmin_simdp(1.0, 1.0, 1.0, 1.0)
18.519 ns (0 allocations: 0 bytes)
-0.09861228866810992
Making it harder to cheat the benchmark:
julia> @btime softmin_julia($(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[])
18.583 ns (0 allocations: 0 bytes)
-0.09861228866810956
julia> @btime softmin_simd($(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[])
32.469 ns (0 allocations: 0 bytes)
-0.09861228866810973
julia> @btime softmin_simdp($(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[])
18.559 ns (0 allocations: 0 bytes)
-0.09861228866810992
You could also add @noinline
annotations:
julia> @fastmath @noinline function softmin_julia_noinline(a::T,b,c, γ) where T
γ = -γ
a,b,c = a/γ, b/γ, c/γ
maxv = max(a,b,c)
ae,be,ce = exp(a - maxv), exp(b - maxv), exp(c - maxv)
γ*(log(ae+be+ce) + maxv)
end
softmin_julia_noinline (generic function with 1 method)
julia> @noinline function softmin_simdp_noinline(a::T,b,c, γ) where T
γ = -γ
v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
v = v ./ γ
maxv = maximum(v)
ve = exp(v - maxv) * SIMDPirates.SVec{4,T}((one(T), one(T), one(T), zero(T)))
γ*(log(sum(ve)) + maxv)
end
softmin_simdp_noinline (generic function with 1 method)
julia> @noinline function softmin_simd_noinline(a::T,b,c, γ) where T
γ = -γ
v = SIMD.Vec{4,T}((a, b, c, zero(T)))
v = v / γ
maxv = maximum(v)
ve = exp(v - maxv) * SIMD.Vec{4,T}((one(T), one(T), one(T), zero(T)))
γ*(log(sum(ve)) + maxv)
end
softmin_simd_noinline (generic function with 1 method)
julia> @btime softmin_julia_noinline(1.0, 1.0, 1.0, 1.0)
18.607 ns (0 allocations: 0 bytes)
-0.09861228866810956
julia> @btime softmin_simd_noinline(1.0, 1.0, 1.0, 1.0)
33.525 ns (0 allocations: 0 bytes)
-0.09861228866810973
julia> @btime softmin_simdp_noinline(1.0, 1.0, 1.0, 1.0)
18.521 ns (0 allocations: 0 bytes)
-0.09861228866810992
It still isn't really faster than the base Julia version. That's probably because of how many scalar operations you have, and constant transitions back and forth between using scalars and vectors. Those transitions are not cheap.
julia> sx8 = SVec(ntuple(Val(8)) do _ Core.VecElement(rand()) end)
SVec{8,Float64}<0.9624197553870346, 0.02572482796496467, 0.24800450254630002, 0.10130370440652925, 0.4164007115948414, 0.3445471696363833, 0.376804748601024, 0.7396548714084865>
julia> sx4 = SVec(ntuple(Val(8)) do _ Core.VecElement(rand()) end)
SVec{8,Float64}<0.11491563624476475, 0.20257085040810008, 0.8663878447778868, 0.6559109728401165, 0.09552476759038497, 0.4653985642583569, 0.7181069232425974, 0.35771529752902187>
julia> @btime exp($sx8)
4.438 ns (0 allocations: 0 bytes)
SVec{8,Float64}<2.618023792228547, 1.026058566999535, 1.2814657020396871, 1.1066126738790019, 1.5164934236336711, 1.4113506725050988, 1.4576196793033618, 2.095212272035349>
julia> @btime exp($sx4)
4.440 ns (0 allocations: 0 bytes)
SVec{8,Float64}<1.121778796108057, 1.2245468416919811, 2.378304514094675, 1.926897068976135, 1.1002360718928326, 1.592648835306877, 2.050547689838137, 1.4300584213511476>
The exp
itself is very fast.
Something else you can try:
julia> a = rand(100); b = rand(100); c = rand(100); γ = rand(100); x = similar(a);
julia> @btime @. $x = softmin_julia($a, $b, $c, $γ);
3.495 μs (0 allocations: 0 bytes)
julia> @btime @. $x = softmin_simd($a, $b, $c, $γ);
3.693 μs (0 allocations: 0 bytes)
julia> @btime @. $x = softmin_simdp($a, $b, $c, $γ);
2.123 μs (0 allocations: 0 bytes)
Thanks for the pointers! The SIMDPirates version ended up being a few percent faster than native julia in the application I had in mind, it appears in this loop (dynamic programming for Dynamic time warping)
for c = 2:n
for r = 2:m
D[r, c] += softmin(transportcost*D[r-1, c], D[r-1, c-1], transportcost*D[r, c-1], γ)
end
end
so it's unfortunately hard to SIMD on any other level than inside the function itself. I already have LoopVectorization as a dependency so I might as well use the slightly faster version, even if it's not a huge speedup :)
Actually, it turns out that the SIMDPirates version was much more prone to numerical overflow for small γ
than the native julia version, ~which surprises me since the maximum value is subtracted before taking the exp~
It's of course since my initialization of the vector
v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
was stupid, the 0 element becomes the largest after dividing by the negative γ
and ends up being the maximum so effectively, no normalization was done.
max(a,b,c)
should be slightly faster than maximum(v)
anyway, so you could try something like this:
function softmin_simdp(a::T,b,c, γ) where T
γ = -γ
ninvγ = one(T) / γ
v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
v = v * ninvγ
maxv = Base.FastMath.max_fast(a,b,c) * ninvγ
ve = exp(v - maxv) * SIMDPirates.SVec{4,T}((one(T), one(T), one(T), zero(T)))
γ*(log(sum(ve)) + maxv)
end
Base.FastMath.max_fast
is fast as promised, but wont behave correctly in the presence of NaN
s.
Is it still more prone to numerical overflow? If so, what platform are you? I have three SIMD exp implementations, and it would be nice to know how they perform in that regard.
mveclib
on Linux platforms that have it.Your latest suggestion was indeed some 25% faster. I get the following timings
a = randn(128)
b = randn(128)
@btime soft_dtw_cost($a,$b)
1.392 ms (2 allocations: 128.08 KiB) # Julia native
1.199 ms (2 allocations: 128.08 KiB) # initial softmin_simdp
880.386 μs (2 allocations: 128.08 KiB) # Latest softmin_simdp
905.173 μs (2 allocations: 128.08 KiB) # latest with vexp
903.400 μs (2 allocations: 128.08 KiB) # latest with SLEEFPirates.exp
# With Float32
1.082 ms (2 allocations: 64.08 KiB) # Julia native
814.215 μs (2 allocations: 64.08 KiB) # latest
MethodError: no method matching vexp(::NTuple{4,VecElement{Float32}}) # latest with vexp
Nan result (but very fast 450.398 μs ) # Latest with SLEEFPirates.exp
I should mention that I am running this on core i5-2500K without AVX2
The numerical overflow problems were due to my error, they are not a problem anymore. ~The SLEEFPirates version did result in NaN with FLoat32 though.~ see below
Re-running the benchmarks on a newer CPU with AVX2 yields the same relative results. Once again, the NaN was the result of
SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
instead of
SIMDPirates.SVec{4,T}((a, b, c, typemax(T)))
With the correct initialization, SLEEFPirates does not suffer from the NaN result, ~but it is not faster than the fastest one so far~ With correct init and running on the newer CPU, SLEEFPirates.exp is indeed a bit faster for both Float64 and Float32
# Float32
julia> @btime soft_dtw_cost($a,$b) # Sleefpirates.exp
741.724 μs (2 allocations: 64.08 KiB)
-6.6161013f0
julia> @btime soft_dtw_cost($a,$b) # Regular exp
870.516 μs (2 allocations: 64.08 KiB)
-6.6161013f0
# Float64
julia> @btime soft_dtw_cost($a,$b) # regular exp
783.779 μs (2 allocations: 128.08 KiB)
-27.517397953355893
julia> @btime soft_dtw_cost($a,$b) # Sleefpirates.exp
716.770 μs (2 allocations: 128.08 KiB)
-27.51739795335587
Are you on Linux?
Base.exp(::SVec{2})
and Base.exp(::SVec{4})
may be using GLIBC's implementation if you are..
Otherwise, it'll either be using SIMDPirates.vexp
, or this implementation which is less specific than the GLIBC definitions and would thus be shadowed by them.
I should mention that I am running this on core i5-2500K without AVX2
Ah, yeah, that's a problem. SIMDPirates.vexp
tries to use SIMD integer operations, which requires AVX2.
Although, same relative performance (including vs the base Julia implementation) means that maybe it's not that big a problem after all.
One option you could consider is compiling the real SLEEF to llvm-bitcode, and then wrap the functions you want with Base.llvmcall
.
I'd accept a PR.
I should probably add more attribution, but I did that for a few functions here. I need to add an attribution there.
Looking at the source, I also see that they use @llvm.fma
instead of @llvm.muladd
(on top of using SIMD integer operations), which would lead to poor performance given you don't have native fma.
Yeah I'm on Ubuntu. I was mistaken in my benchmarking on the faster CPU and have updated my post above. SLEEFPirates.exp is indeed a bit faster on the newer CPU. Interestingly, it's faster for Float64 than for Float32
Please excuse me if I'm using this package in the wrong way. I'm optimizing a softmin implementation and tried using
@avx
The
@avx
version fails with https://github.com/chriselrod/LoopVectorization.jl/issues/119 but the other ones give the same results. SIMDPirates appears to produce much slower code than native julia in this case and I wonder if I'm using it incorrectly. SIMD.jl is quite a bit faster than native julia.Adding
@fastmath
annotations on the two SIMD alternatives make them both slower.