Closed GianlucaFuwa closed 6 months ago
It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore.
It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore.
I might be misunderstanding, but I can't find any benchmarks using threadlocal
that show allocations in the README. I could add something like:
julia> function batch_reduction()
y1 = 0
y2 = 1
@batch reduction=((+, y1), (*, y2)) for i in 1:9
y1 += i
y2 *= i
end
y1, y2
end
julia> batch_reduction()
(45, 362880)
julia> @allocated batch_reduction()
0
to show that reduction
doesn't allocate though.
I also forgot to show this PR affects performance of threadlocal
:
using Test
function f()
@batch threadlocal=rand(10:99)::Int for i in 0:9
threadlocal += 1
end
sum(threadlocal)
end
try
@inferred f()
catch e
println("Not inferred")
end
@benchmark f()
Old:
Not inferred
BenchmarkTools.Trial: 874 samples with 170 evaluations.
Range (min … max): 785.882 ns … 188.296 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 27.131 μs ┊ GC (median): 0.00%
Time (mean ± σ): 33.600 μs ± 25.621 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃▄▆▇█▇▄▅▂▂▂▂▂ ▁ ▂ ▄
████████████████▇██▇█▅▅▆███▇▇▆█▆▇▆▆▇▆▄▄▅▅▄▄▅▃▃▄▃▃▃▃▁▃▃▂▂▁▃▁▁▃ ▅
786 ns Histogram: frequency by time 108 μs <
Memory estimate: 224 bytes, allocs estimate: 5.
New:
BenchmarkTools.Trial: 10000 samples with 287 evaluations.
Range (min … max): 282.927 ns … 201.847 μs ┊ GC (min … max): 0.00% … 2.28%
Time (median): 303.833 ns ┊ GC (median): 0.00%
Time (mean ± σ): 644.395 ns ± 4.433 μs ┊ GC (mean ± σ): 0.33% ± 0.05%
█▂▂▁ ▁
█████▆▆▇█▆▆▆▅▃▅▁▄▄▄▆▄▆▆▅▆▆▆▅▄▃▄▄▃▅▄▄▆▄▄▄▄▅▄▅▃▁▃▁▄▃▁▃▃▃▄▁▁▄▁▄▄ █
283 ns Histogram: log(frequency) by time 7.69 μs <
Memory estimate: 96 bytes, allocs estimate: 1.
My apologies, the readme has been simplified significantly since the last time I looked. Disregard my comment.
Performance seems to be worse than I'd have expected.
julia> using LoopVectorization, BenchmarkTools, Polyester, LinearAlgebra
julia> function dot_batch(x,y)
s = zero(Base.promote_eltype(x,y))
@batch reduction=((+,s),) for i = eachindex(x,y)
@inbounds @fastmath s += x[i]*y[i]
end
s
end
dot_batch (generic function with 1 method)
julia> function dot_tturbo(x,y)
s = zero(Base.promote_eltype(x,y))
@tturbo for i = eachindex(x,y)
s += x[i]*y[i]
end
s
end
dot_tturbo (generic function with 1 method)
julia> function dot_turbo(x,y)
s = zero(Base.promote_eltype(x,y))
@turbo for i = eachindex(x,y)
s += x[i]*y[i]
end
s
end
dot_turbo (generic function with 1 method)
julia> x = rand(16384*8);
julia> @btime dot_turbo($x,$x)
8.710 μs (0 allocations: 0 bytes)
43620.94200326388
julia> @btime dot_tturbo($x,$x)
3.097 μs (0 allocations: 0 bytes)
43620.94200326386
julia> @btime dot($x,$x)
3.174 μs (0 allocations: 0 bytes)
43620.942003263866
julia> @btime dot_batch($x,$x)
32.269 μs (0 allocations: 0 bytes)
43620.94200326387
julia> using LinuxPerf
julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(dot_batch, 1_000_000, x, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 5.01e+11 100.0% # 3.5 cycles per ns
└ task-clock 1.42e+11 100.0% # 141.6 s
┌ instructions 5.06e+11 100.0% # 1.0 insns per cycle
│ branch-instructions 1.00e+11 100.0% # 19.8% of insns
└ branch-misses 8.47e+06 100.0% # 0.0% of branch insns
┌ L1-dcache-load-misses 1.65e+10 100.0% # 8.0% of dcache loads
│ L1-dcache-loads 2.06e+11 100.0%
│ cache-misses 5.86e+05 100.0% # 2.1% of cache refs
└ cache-references 2.86e+07 100.0%
aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(dot_tturbo, 1_000_000, x, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 3.92e+10 100.0% # 3.6 cycles per ns
└ task-clock 1.08e+10 100.0% # 10.8 s
┌ instructions 4.57e+10 100.0% # 1.2 insns per cycle
│ branch-instructions 4.19e+09 100.0% # 9.2% of insns
└ branch-misses 5.73e+06 100.0% # 0.1% of branch insns
┌ L1-dcache-load-misses 1.65e+10 100.0% # 50.0% of dcache loads
│ L1-dcache-loads 3.29e+10 100.0%
│ cache-misses 5.42e+04 100.0% # 0.3% of cache refs
└ cache-references 1.76e+07 100.0%
aggregated from 3 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(dot, 1_000_000, x, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 5.04e+10 100.0% # 3.4 cycles per ns
└ task-clock 1.48e+10 100.0% # 14.8 s
┌ instructions 6.40e+10 100.0% # 1.3 insns per cycle
│ branch-instructions 6.45e+09 100.0% # 10.1% of insns
└ branch-misses 8.16e+06 100.0% # 0.1% of branch insns
┌ L1-dcache-load-misses 1.69e+10 100.0% # 46.5% of dcache loads
│ L1-dcache-loads 3.64e+10 100.0%
│ cache-misses 2.80e+05 100.0% # 0.5% of cache refs
└ cache-references 6.21e+07 100.0%
aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(dot_turbo, 1_000_000, x, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 4.05e+10 100.0% # 4.1 cycles per ns
└ task-clock 9.81e+09 100.0% # 9.8 s
┌ instructions 3.90e+10 100.0% # 1.0 insns per cycle
│ branch-instructions 2.06e+09 100.0% # 5.3% of insns
└ branch-misses 1.18e+06 100.0% # 0.1% of branch insns
┌ L1-dcache-load-misses 1.64e+10 100.0% # 50.0% of dcache loads
│ L1-dcache-loads 3.28e+10 100.0%
│ cache-misses 3.07e+05 100.0% # 0.0% of cache refs
└ cache-references 1.86e+09 100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
It requires around 10x as many instructions as @tturbo
for this example, resulting in around 10x worse performance.
I've not looked into this/generated code yet, but could there be something interfering with SIMD here?
I don't see anything that strikes out to me as wrong from a quick look.
Attention: Patch coverage is 97.79006%
with 4 lines
in your changes are missing coverage. Please review.
Project coverage is 93.37%. Comparing base (
0952d11
) to head (ab511e8
). Report is 2 commits behind head on master.
Files | Patch % | Lines |
---|---|---|
src/batch.jl | 97.75% | 2 Missing :warning: |
src/closure.jl | 97.64% | 2 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I've not looked into this/generated code yet, but could there be something interfering with SIMD here?
I have no clue what could be causing all the extra instructions / interference with SIMD. I'm running on a Windows machine so I can't check, but is this a result of the changes I made, or do they appear when using @batch
in general?
Nightly fails due to excessive allocations. After checking nightly myself, I found that the amount and size of allocations is rather random, so I'm not sure how to handle them.
PProf tells me that the objects being allocated are ThreadState
s from ThreadingUtilities.jl
Thanks. I don't think we should bother working around flaws in nightly.
Fixes https://github.com/JuliaSIMD/Polyester.jl/issues/112 by passing the threadlocal accumulator as an argument to the batchcall instead of capturing it.
I found that adding a more restrictive, OpenMP-like
reduction
flag was more elegant than checking whetherthreadlocal
wasisbits
to eliminate allocations in such cases. Using this looks like:It works by allocating the initial value (neutral element of the operation) within each threads function and loading + reducing them at the end. This PR also includes the
tid
fix from https://github.com/JuliaSIMD/Polyester.jl/pull/131/commits/faab451e419553294ccab1562eafa519c90ab256, fixes in the documentation and additional tests for thereduction
functionality.