chriselrod / SIMDPirates.jl

Pirating base type NTuple{N,Core.VecElement{T}} -- although base methods are NOT overloaded -- and plundering the great work of eschnett's SIMD.jl
Other
26 stars 5 forks source link

Failed optimization? #10

Closed baggepinnen closed 4 years ago

baggepinnen commented 4 years ago

Please excuse me if I'm using this package in the wrong way. I'm optimizing a softmin implementation and tried using

The @avx version fails with https://github.com/chriselrod/LoopVectorization.jl/issues/119 but the other ones give the same results. SIMDPirates appears to produce much slower code than native julia in this case and I wonder if I'm using it incorrectly. SIMD.jl is quite a bit faster than native julia.

@fastmath function softmin_julia(a::T,b,c, γ) where T
    γ = -γ
    a,b,c = a/γ, b/γ, c/γ
    maxv = max(a,b,c)
    ae,be,ce = exp(a - maxv), exp(b - maxv), exp(c - maxv)
    γ*(log(ae+be+ce) + maxv)
end

using SIMDPirates
function softmin_simdp(a::T,b,c, γ) where T
    γ = -γ
    v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
    v = v ./ γ
    maxv = maximum(v)
    ve = exp(v - maxv) * SIMDPirates.SVec{4,T}((one(T), one(T), one(T), zero(T)))

    γ*(log(sum(ve)) + maxv)
end

using SIMD
function softmin_simd(a::T,b,c, γ) where T
    γ = -γ
    v = SIMD.Vec{4,T}((a, b, c, zero(T)))
    v = v / γ
    maxv = maximum(v)
    ve = exp(v - maxv) * SIMD.Vec{4,T}((one(T), one(T), one(T), zero(T)))

    γ*(log(sum(ve)) + maxv)
end

@test softmin_julia(1.0, 1.0, 1.0, 1.0) ≈ softmin_simd(1.0, 1.0, 1.0, 1.0) ≈ softmin_simdp(1.0, 1.0, 1.0, 1.0) # Passes

julia> @btime softmin_julia(1.0, 1.0, 1.0, 1.0)
  14.468 ns (0 allocations: 0 bytes)
-0.09861228866810956

julia> @btime softmin_simd(1.0, 1.0, 1.0, 1.0)
  8.609 ns (0 allocations: 0 bytes)
-0.09861228866810973

julia> @btime softmin_simdp(1.0, 1.0, 1.0, 1.0)
  48.147 ns (0 allocations: 0 bytes)
-0.09861228866810973

Adding @fastmath annotations on the two SIMD alternatives make them both slower.

chriselrod commented 4 years ago

You should using SLEEFPirates to get the SIMD exp function, or you can use SIMDPirates.vexp, which is a different implementation (but normally slower, I've found).

SIMD.jl is cheating this benchmark. It uses @llvm.exp, which LLVM knows it can hoist out of loops, etc, given constant arguments.

using SIMDPirates, SIMD, BenchmarkTools, SLEEFPirates, Test

@fastmath function softmin_julia(a::T,b,c, γ) where T
    γ = -γ
    a,b,c = a/γ, b/γ, c/γ
    maxv = max(a,b,c)
    ae,be,ce = exp(a - maxv), exp(b - maxv), exp(c - maxv)
    γ*(log(ae+be+ce) + maxv)
end

function softmin_simdp(a::T,b,c, γ) where T
    γ = -γ
    v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
    v = v ./ γ
    maxv = maximum(v)
    ve = exp(v - maxv) * SIMDPirates.SVec{4,T}((one(T), one(T), one(T), zero(T)))

    γ*(log(sum(ve)) + maxv)
end

function softmin_simd(a::T,b,c, γ) where T
    γ = -γ
    v = SIMD.Vec{4,T}((a, b, c, zero(T)))
    v = v / γ
    maxv = maximum(v)
    ve = exp(v - maxv) * SIMD.Vec{4,T}((one(T), one(T), one(T), zero(T)))

    γ*(log(sum(ve)) + maxv)
end

@test softmin_julia(1.0, 1.0, 1.0, 1.0) ≈ softmin_simd(1.0, 1.0, 1.0, 1.0) ≈ softmin_simdp(1.0, 1.0, 1.0, 1.0) # Passes

Reproducing your results:

julia> @btime softmin_julia(1.0, 1.0, 1.0, 1.0)
  6.987 ns (0 allocations: 0 bytes)
-0.09861228866810956

julia> @btime softmin_simd(1.0, 1.0, 1.0, 1.0)
  5.027 ns (0 allocations: 0 bytes)
-0.09861228866810973

julia> @btime softmin_simdp(1.0, 1.0, 1.0, 1.0)
  18.519 ns (0 allocations: 0 bytes)
-0.09861228866810992

Making it harder to cheat the benchmark:

julia> @btime softmin_julia($(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[])
  18.583 ns (0 allocations: 0 bytes)
-0.09861228866810956

julia> @btime softmin_simd($(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[])
  32.469 ns (0 allocations: 0 bytes)
-0.09861228866810973

julia> @btime softmin_simdp($(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[], $(Ref(1.0))[])
  18.559 ns (0 allocations: 0 bytes)
-0.09861228866810992

You could also add @noinline annotations:

julia> @fastmath @noinline function softmin_julia_noinline(a::T,b,c, γ) where T
           γ = -γ
           a,b,c = a/γ, b/γ, c/γ
           maxv = max(a,b,c)
           ae,be,ce = exp(a - maxv), exp(b - maxv), exp(c - maxv)
           γ*(log(ae+be+ce) + maxv)
       end
softmin_julia_noinline (generic function with 1 method)

julia> @noinline function softmin_simdp_noinline(a::T,b,c, γ) where T
           γ = -γ
           v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
           v = v ./ γ
           maxv = maximum(v)
           ve = exp(v - maxv) * SIMDPirates.SVec{4,T}((one(T), one(T), one(T), zero(T)))

           γ*(log(sum(ve)) + maxv)
       end
softmin_simdp_noinline (generic function with 1 method)

julia> @noinline function softmin_simd_noinline(a::T,b,c, γ) where T
           γ = -γ
           v = SIMD.Vec{4,T}((a, b, c, zero(T)))
           v = v / γ
           maxv = maximum(v)
           ve = exp(v - maxv) * SIMD.Vec{4,T}((one(T), one(T), one(T), zero(T)))

           γ*(log(sum(ve)) + maxv)
       end
softmin_simd_noinline (generic function with 1 method)

julia> @btime softmin_julia_noinline(1.0, 1.0, 1.0, 1.0)
  18.607 ns (0 allocations: 0 bytes)
-0.09861228866810956

julia> @btime softmin_simd_noinline(1.0, 1.0, 1.0, 1.0)
  33.525 ns (0 allocations: 0 bytes)
-0.09861228866810973

julia> @btime softmin_simdp_noinline(1.0, 1.0, 1.0, 1.0)
  18.521 ns (0 allocations: 0 bytes)
-0.09861228866810992

It still isn't really faster than the base Julia version. That's probably because of how many scalar operations you have, and constant transitions back and forth between using scalars and vectors. Those transitions are not cheap.

julia> sx8 = SVec(ntuple(Val(8)) do _ Core.VecElement(rand()) end)
SVec{8,Float64}<0.9624197553870346, 0.02572482796496467, 0.24800450254630002, 0.10130370440652925, 0.4164007115948414, 0.3445471696363833, 0.376804748601024, 0.7396548714084865>

julia> sx4 = SVec(ntuple(Val(8)) do _ Core.VecElement(rand()) end)
SVec{8,Float64}<0.11491563624476475, 0.20257085040810008, 0.8663878447778868, 0.6559109728401165, 0.09552476759038497, 0.4653985642583569, 0.7181069232425974, 0.35771529752902187>

julia> @btime exp($sx8)
  4.438 ns (0 allocations: 0 bytes)
SVec{8,Float64}<2.618023792228547, 1.026058566999535, 1.2814657020396871, 1.1066126738790019, 1.5164934236336711, 1.4113506725050988, 1.4576196793033618, 2.095212272035349>

julia> @btime exp($sx4)
  4.440 ns (0 allocations: 0 bytes)
SVec{8,Float64}<1.121778796108057, 1.2245468416919811, 2.378304514094675, 1.926897068976135, 1.1002360718928326, 1.592648835306877, 2.050547689838137, 1.4300584213511476>

The exp itself is very fast.

chriselrod commented 4 years ago

Something else you can try:

julia> a = rand(100); b = rand(100); c = rand(100); γ = rand(100); x = similar(a);

julia> @btime @. $x = softmin_julia($a, $b, $c, $γ);
  3.495 μs (0 allocations: 0 bytes)

julia> @btime @. $x = softmin_simd($a, $b, $c, $γ);
  3.693 μs (0 allocations: 0 bytes)

julia> @btime @. $x = softmin_simdp($a, $b, $c, $γ);
  2.123 μs (0 allocations: 0 bytes)
baggepinnen commented 4 years ago

Thanks for the pointers! The SIMDPirates version ended up being a few percent faster than native julia in the application I had in mind, it appears in this loop (dynamic programming for Dynamic time warping)

for c = 2:n
    for r = 2:m
        D[r, c] += softmin(transportcost*D[r-1, c], D[r-1, c-1], transportcost*D[r, c-1], γ)
    end
end

so it's unfortunately hard to SIMD on any other level than inside the function itself. I already have LoopVectorization as a dependency so I might as well use the slightly faster version, even if it's not a huge speedup :)

baggepinnen commented 4 years ago

Actually, it turns out that the SIMDPirates version was much more prone to numerical overflow for small γ than the native julia version, ~which surprises me since the maximum value is subtracted before taking the exp~ It's of course since my initialization of the vector

v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))

was stupid, the 0 element becomes the largest after dividing by the negative γ and ends up being the maximum so effectively, no normalization was done.

chriselrod commented 4 years ago

max(a,b,c) should be slightly faster than maximum(v) anyway, so you could try something like this:

function softmin_simdp(a::T,b,c, γ) where T
    γ = -γ
    ninvγ = one(T) / γ
    v = SIMDPirates.SVec{4,T}((a, b, c, zero(T)))
    v = v * ninvγ
    maxv = Base.FastMath.max_fast(a,b,c) * ninvγ
    ve = exp(v - maxv) * SIMDPirates.SVec{4,T}((one(T), one(T), one(T), zero(T)))

    γ*(log(sum(ve)) + maxv)
end

Base.FastMath.max_fast is fast as promised, but wont behave correctly in the presence of NaNs.

Is it still more prone to numerical overflow? If so, what platform are you? I have three SIMD exp implementations, and it would be nice to know how they perform in that regard.

  1. SIMDPirates.vexp
  2. SLEEFPirates.exp defined as Julia source
  3. SLEEFPirates.exp, wrapping mveclib on Linux platforms that have it.
baggepinnen commented 4 years ago

Your latest suggestion was indeed some 25% faster. I get the following timings

a = randn(128)
b = randn(128)
@btime soft_dtw_cost($a,$b)
1.392 ms   (2 allocations: 128.08 KiB) # Julia native
1.199 ms   (2 allocations: 128.08 KiB) # initial softmin_simdp
880.386 μs (2 allocations: 128.08 KiB) # Latest softmin_simdp
905.173 μs (2 allocations: 128.08 KiB) # latest with vexp
903.400 μs (2 allocations: 128.08 KiB) # latest with SLEEFPirates.exp

# With Float32
1.082 ms (2 allocations: 64.08 KiB) # Julia native
814.215 μs (2 allocations: 64.08 KiB) # latest
MethodError: no method matching vexp(::NTuple{4,VecElement{Float32}}) # latest with vexp
Nan result (but very fast 450.398 μs ) # Latest with SLEEFPirates.exp

I should mention that I am running this on core i5-2500K without AVX2

The numerical overflow problems were due to my error, they are not a problem anymore. ~The SLEEFPirates version did result in NaN with FLoat32 though.~ see below

baggepinnen commented 4 years ago

Re-running the benchmarks on a newer CPU with AVX2 yields the same relative results. Once again, the NaN was the result of

SIMDPirates.SVec{4,T}((a, b, c, zero(T)))

instead of

SIMDPirates.SVec{4,T}((a, b, c, typemax(T)))

With the correct initialization, SLEEFPirates does not suffer from the NaN result, ~but it is not faster than the fastest one so far~ With correct init and running on the newer CPU, SLEEFPirates.exp is indeed a bit faster for both Float64 and Float32

# Float32 
julia> @btime soft_dtw_cost($a,$b) # Sleefpirates.exp
  741.724 μs (2 allocations: 64.08 KiB)
-6.6161013f0

julia> @btime soft_dtw_cost($a,$b) # Regular exp
  870.516 μs (2 allocations: 64.08 KiB)
-6.6161013f0

# Float64
julia> @btime soft_dtw_cost($a,$b) # regular exp
  783.779 μs (2 allocations: 128.08 KiB)
-27.517397953355893

julia> @btime soft_dtw_cost($a,$b) # Sleefpirates.exp
  716.770 μs (2 allocations: 128.08 KiB)
-27.51739795335587
chriselrod commented 4 years ago

Are you on Linux? Base.exp(::SVec{2}) and Base.exp(::SVec{4}) may be using GLIBC's implementation if you are.. Otherwise, it'll either be using SIMDPirates.vexp, or this implementation which is less specific than the GLIBC definitions and would thus be shadowed by them.

I should mention that I am running this on core i5-2500K without AVX2

Ah, yeah, that's a problem. SIMDPirates.vexp tries to use SIMD integer operations, which requires AVX2. Although, same relative performance (including vs the base Julia implementation) means that maybe it's not that big a problem after all.

One option you could consider is compiling the real SLEEF to llvm-bitcode, and then wrap the functions you want with Base.llvmcall. I'd accept a PR. I should probably add more attribution, but I did that for a few functions here. I need to add an attribution there. Looking at the source, I also see that they use @llvm.fma instead of @llvm.muladd (on top of using SIMD integer operations), which would lead to poor performance given you don't have native fma.

baggepinnen commented 4 years ago

Yeah I'm on Ubuntu. I was mistaken in my benchmarking on the faster CPU and have updated my post above. SLEEFPirates.exp is indeed a bit faster on the newer CPU. Interestingly, it's faster for Float64 than for Float32