Open Moelf opened 1 day ago
on my machine, without @turbo
is faster...
# with @turbo
akako@frame16 ~/D/g/JetReconstructionBenchmarks.jl (main)> julia --project src/benchmark.jl --backend Julia --repeats 30 data/events-pp-13TeV-20GeV.hepmc3
1×5 DataFrame
Row │ File_path File mean_particles n_samples time_per_event
│ Any String Int64 Int64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ data/events-pp-13TeV-20GeV.hepmc3 events-pp-13TeV-20GeV.hepmc3 -1 16 38.7031
## without `@turbo`
akako@frame16 ~/D/g/JetReconstructionBenchmarks.jl (main)> julia --project src/benchmark.jl --backend Julia --repeats 30 data/events-pp-13TeV-20GeV.hepmc3
Precompiling JetReconstruction...
1 dependency successfully precompiled in 2 seconds. 117 already precompiled.
1×5 DataFrame
Row │ File_path File mean_particles n_samples time_per_event
│ Any String Int64 Int64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ data/events-pp-13TeV-20GeV.hepmc3 events-pp-13TeV-20GeV.hepmc3 -1 16 29.7108
I do see a difference in micro benchmark:
julia> using LoopVectorization, Chairmarks
julia> function fast_findmin(dij, n)
# findmin(@inbounds @view dij[1:n])
best = 1
@inbounds dij_min = dij[1]
@turbo for here in 2:n
newmin = dij[here] < dij_min
best = newmin ? here : best
dij_min = newmin ? dij[here] : dij_min
end
dij_min, best
end
julia> function setup_bench()
N = rand(200:500)
dij = abs.(rand(N))
first_n = rand(1:N)
dij, first_n
end^C
julia> function basic_findmin(dij, n)
# findmin(@inbounds @view dij[1:n])
best = 1
@inbounds dij_min = dij[1]
for here in 2:n
newmin = dij[here] < dij_min
best = newmin ? here : best
dij_min = newmin ? dij[here] : dij_min
end
dij_min, best
end
julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 24004 samples with 1 evaluation
min 50.000 ns
median 90.000 ns
mean 92.675 ns
max 681.000 ns
julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 20405 samples with 1 evaluation
min 50.000 ns
median 90.000 ns
mean 92.634 ns
max 1.442 μs
julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 7178 samples with 1 evaluation
min 40.000 ns
median 240.000 ns
mean 255.817 ns
max 782.000 ns
julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 25737 samples with 1 evaluation
min 40.000 ns
median 241.000 ns
mean 260.301 ns
max 14.146 μs
Hi @Moelf - yes, this has been on my TODO as well, as we might loose LoopVectorization
. My micro-benchmark/testcase is here.
Just running that now on my M2 Pro I get the following:
findmin
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 171.583 μs … 6.069 ms ┊ GC (min … max): 0.00% … 96.63%
Time (median): 176.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 181.996 μs ± 102.615 μs ┊ GC (mean ± σ): 1.88% ± 3.33%
▄█▂ ▂▆▄▆
▇███▄▄█████▆▆▆▇▆▅▅▅▄▄▃▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
172 μs Histogram: frequency by time 203 μs <
Memory estimate: 70.34 KiB, allocs estimate: 2001.
findmin
with @turbo
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 47.333 μs … 5.323 ms ┊ GC (min … max): 0.00% … 98.67%
Time (median): 48.250 μs ┊ GC (median): 0.00%
Time (mean ± σ): 50.801 μs ± 64.514 μs ┊ GC (mean ± σ): 2.06% ± 1.69%
▁▆██▄▃▁▃▅▅▃▂▁▁▂▄▅▅▃▂▂▂▃▂▁ ▂
███████████████████████████▇█▇███▇█████▇▇█▇█▇▇▆▆▆▆▅▇▆▆▅▅▅▅▅ █
47.3 μs Histogram: log(frequency) by time 60.9 μs <
Memory estimate: 31.28 KiB, allocs estimate: 1001.
Note that this test shrinks the array by 1 step each time, which correctly mocks up the jet reconstruction case - however, I see much the same result for a test against a fixed array size.
yeah I can reproduce that, I also opened: https://discourse.julialang.org/t/faster-findmin-without-loopvectorization-jl/121742/3
I saw - thanks. I will pitch in there later.
function naive_findmin(dij, n)
x = @fastmath foldl(min, dij)
i = findfirst(==(x), dij)::Int
x, i
end
looks quite promising
Since LV.jl is going away in 1.12, maybe we should attempt to remove it.
I'm looking at https://github.com/graeme-a-stewart/JetReconstructionBenchmarks.jl/ and want to find a small set of benchmarks I can run to monitor the performance when making changes, @graeme-a-stewart can you help to point out a minimal set of things to run, using that repo? It's not completely clear what to run (e.g. example snippets)