Fix `threadlocal` type-inferrability and add `reduction` functionality

GianlucaFuwa commented 6 months ago

Fixes https://github.com/JuliaSIMD/Polyester.jl/issues/112 by passing the threadlocal accumulator as an argument to the batchcall instead of capturing it.

I found that adding a more restrictive, OpenMP-like reduction flag was more elegant than checking whether threadlocal was isbits to eliminate allocations in such cases. Using this looks like:

red1 = 0
red2 = 0
red3 = 0
@batch reduction = ((+,red1), (+,red2), (+,red3)) for i = 0:9
  red1 += 1
  red2 += 1
  red3 -= 1
end
red1, red2, red3

It works by allocating the initial value (neutral element of the operation) within each threads function and loading + reducing them at the end. This PR also includes the tid fix from https://github.com/JuliaSIMD/Polyester.jl/pull/131/commits/faab451e419553294ccab1562eafa519c90ab256, fixes in the documentation and additional tests for the reduction functionality.

Krastanov commented 6 months ago

It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore.

GianlucaFuwa commented 6 months ago

It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore.

I might be misunderstanding, but I can't find any benchmarks using threadlocal that show allocations in the README. I could add something like:

julia> function batch_reduction()
           y1 = 0
           y2 = 1
           @batch reduction=((+, y1), (*, y2)) for i in 1:9
               y1 += i
               y2 *= i
           end
           y1, y2
       end
julia> batch_reduction()
(45, 362880)
julia> @allocated batch_reduction()
0

to show that reduction doesn't allocate though.

I also forgot to show this PR affects performance of threadlocal:

using Test

function f()
    @batch threadlocal=rand(10:99)::Int for i in 0:9
        threadlocal += 1
    end
    sum(threadlocal)
end

try
    @inferred f()
catch e
    println("Not inferred")
end
@benchmark f()

Old:

Not inferred
BenchmarkTools.Trial: 874 samples with 170 evaluations.
 Range (min … max):  785.882 ns … 188.296 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      27.131 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    33.600 μs ±  25.621 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▃▄▆▇█▇▄▅▂▂▂▂▂  ▁  ▂     ▄   
  ████████████████▇██▇█▅▅▆███▇▇▆█▆▇▆▆▇▆▄▄▅▅▄▄▅▃▃▄▃▃▃▃▁▃▃▂▂▁▃▁▁▃ ▅
  786 ns           Histogram: frequency by time          108 μs <

 Memory estimate: 224 bytes, allocs estimate: 5.

New:

BenchmarkTools.Trial: 10000 samples with 287 evaluations.
 Range (min … max):  282.927 ns … 201.847 μs  ┊ GC (min … max): 0.00% … 2.28%
 Time  (median):     303.833 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   644.395 ns ±   4.433 μs  ┊ GC (mean ± σ):  0.33% ± 0.05%

  █▂▂▁                                                          ▁
  █████▆▆▇█▆▆▆▅▃▅▁▄▄▄▆▄▆▆▅▆▆▆▅▄▃▄▄▃▅▄▄▆▄▄▄▄▅▄▅▃▁▃▁▄▃▁▃▃▃▄▁▁▄▁▄▄ █
  283 ns        Histogram: log(frequency) by time       7.69 μs <

 Memory estimate: 96 bytes, allocs estimate: 1.

Krastanov commented 6 months ago

My apologies, the readme has been simplified significantly since the last time I looked. Disregard my comment.

chriselrod commented 6 months ago

Performance seems to be worse than I'd have expected.

julia> using LoopVectorization, BenchmarkTools, Polyester, LinearAlgebra

julia> function dot_batch(x,y)
           s = zero(Base.promote_eltype(x,y))
           @batch reduction=((+,s),) for i = eachindex(x,y)
               @inbounds @fastmath s += x[i]*y[i]
           end
           s
       end
dot_batch (generic function with 1 method)

julia> function dot_tturbo(x,y)
           s = zero(Base.promote_eltype(x,y))
           @tturbo for i = eachindex(x,y)
               s += x[i]*y[i]
           end
           s
       end
dot_tturbo (generic function with 1 method)

julia> function dot_turbo(x,y)
           s = zero(Base.promote_eltype(x,y))
           @turbo for i = eachindex(x,y)
               s += x[i]*y[i]
           end
           s
       end
dot_turbo (generic function with 1 method)

julia> x = rand(16384*8);

julia> @btime dot_turbo($x,$x)
  8.710 μs (0 allocations: 0 bytes)
43620.94200326388

julia> @btime dot_tturbo($x,$x)
  3.097 μs (0 allocations: 0 bytes)
43620.94200326386

julia> @btime dot($x,$x)
  3.174 μs (0 allocations: 0 bytes)
43620.942003263866

julia> @btime dot_batch($x,$x)
  32.269 μs (0 allocations: 0 bytes)
43620.94200326387

julia> using LinuxPerf

julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_batch, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               5.01e+11  100.0%  #  3.5 cycles per ns
└ task-clock               1.42e+11  100.0%  # 141.6 s
┌ instructions             5.06e+11  100.0%  #  1.0 insns per cycle
│ branch-instructions      1.00e+11  100.0%  # 19.8% of insns
└ branch-misses            8.47e+06  100.0%  #  0.0% of branch insns
┌ L1-dcache-load-misses    1.65e+10  100.0%  #  8.0% of dcache loads
│ L1-dcache-loads          2.06e+11  100.0%
│ cache-misses             5.86e+05  100.0%  #  2.1% of cache refs
└ cache-references         2.86e+07  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_tturbo, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               3.92e+10  100.0%  #  3.6 cycles per ns
└ task-clock               1.08e+10  100.0%  # 10.8 s
┌ instructions             4.57e+10  100.0%  #  1.2 insns per cycle
│ branch-instructions      4.19e+09  100.0%  #  9.2% of insns
└ branch-misses            5.73e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.65e+10  100.0%  # 50.0% of dcache loads
│ L1-dcache-loads          3.29e+10  100.0%
│ cache-misses             5.42e+04  100.0%  #  0.3% of cache refs
└ cache-references         1.76e+07  100.0%
                  aggregated from 3 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               5.04e+10  100.0%  #  3.4 cycles per ns
└ task-clock               1.48e+10  100.0%  # 14.8 s
┌ instructions             6.40e+10  100.0%  #  1.3 insns per cycle
│ branch-instructions      6.45e+09  100.0%  # 10.1% of insns
└ branch-misses            8.16e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.69e+10  100.0%  # 46.5% of dcache loads
│ L1-dcache-loads          3.64e+10  100.0%
│ cache-misses             2.80e+05  100.0%  #  0.5% of cache refs
└ cache-references         6.21e+07  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_turbo, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.05e+10  100.0%  #  4.1 cycles per ns
└ task-clock               9.81e+09  100.0%  #  9.8 s
┌ instructions             3.90e+10  100.0%  #  1.0 insns per cycle
│ branch-instructions      2.06e+09  100.0%  #  5.3% of insns
└ branch-misses            1.18e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.64e+10  100.0%  # 50.0% of dcache loads
│ L1-dcache-loads          3.28e+10  100.0%
│ cache-misses             3.07e+05  100.0%  #  0.0% of cache refs
└ cache-references         1.86e+09  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

It requires around 10x as many instructions as @tturbo for this example, resulting in around 10x worse performance.

I've not looked into this/generated code yet, but could there be something interfering with SIMD here?

chriselrod commented 6 months ago

I don't see anything that strikes out to me as wrong from a quick look.

codecov[bot] commented 6 months ago

Codecov Report

Attention: Patch coverage is 97.79006% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 93.37%. Comparing base (0952d11) to head (ab511e8). Report is 2 commits behind head on master.

Files	Patch %	Lines
src/batch.jl	97.75%	2 Missing :warning:
src/closure.jl	97.64%	2 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #134 +/- ## ========================================== + Coverage 91.13% 93.37% +2.24% ========================================== Files 3 3 Lines 451 574 +123 ========================================== + Hits 411 536 +125 + Misses 40 38 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

GianlucaFuwa commented 6 months ago

I've not looked into this/generated code yet, but could there be something interfering with SIMD here?

I have no clue what could be causing all the extra instructions / interference with SIMD. I'm running on a Windows machine so I can't check, but is this a result of the changes I made, or do they appear when using @batchin general?

GianlucaFuwa commented 6 months ago

Nightly fails due to excessive allocations. After checking nightly myself, I found that the amount and size of allocations is rather random, so I'm not sure how to handle them. PProf tells me that the objects being allocated are ThreadStates from ThreadingUtilities.jl

chriselrod commented 6 months ago

Thanks. I don't think we should bother working around flaws in nightly.

JuliaSIMD / Polyester.jl

Fix `threadlocal` type-inferrability and add `reduction` functionality #134

Codecov Report