Test removal of LoopVectorization.jl

Moelf commented 1 day ago

Since LV.jl is going away in 1.12, maybe we should attempt to remove it.

I'm looking at https://github.com/graeme-a-stewart/JetReconstructionBenchmarks.jl/ and want to find a small set of benchmarks I can run to monitor the performance when making changes, @graeme-a-stewart can you help to point out a minimal set of things to run, using that repo? It's not completely clear what to run (e.g. example snippets)

Moelf commented 1 day ago

on my machine, without @turbo is faster...

# with @turbo
akako@frame16 ~/D/g/JetReconstructionBenchmarks.jl (main)> julia --project src/benchmark.jl --backend Julia --repeats 30  data/events-pp-13TeV-20GeV.hepmc3
1×5 DataFrame
 Row │ File_path                          File                          mean_particles  n_samples  time_per_event
     │ Any                                String                        Int64           Int64      Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ data/events-pp-13TeV-20GeV.hepmc3  events-pp-13TeV-20GeV.hepmc3              -1         16         38.7031

## without `@turbo`
akako@frame16 ~/D/g/JetReconstructionBenchmarks.jl (main)> julia --project src/benchmark.jl --backend Julia --repeats 30  data/events-pp-13TeV-20GeV.hepmc3
Precompiling JetReconstruction...
  1 dependency successfully precompiled in 2 seconds. 117 already precompiled.
1×5 DataFrame
 Row │ File_path                          File                          mean_particles  n_samples  time_per_event
     │ Any                                String                        Int64           Int64      Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ data/events-pp-13TeV-20GeV.hepmc3  events-pp-13TeV-20GeV.hepmc3              -1         16         29.7108

Moelf commented 1 day ago

I do see a difference in micro benchmark:

julia> using LoopVectorization, Chairmarks

julia> function fast_findmin(dij, n)
           # findmin(@inbounds @view dij[1:n])
           best = 1
           @inbounds dij_min = dij[1]
           @turbo for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> function setup_bench()
           N = rand(200:500)
           dij = abs.(rand(N))
           first_n = rand(1:N)
           dij, first_n
       end^C

julia> function basic_findmin(dij, n)
           # findmin(@inbounds @view dij[1:n])
           best = 1
           @inbounds dij_min = dij[1]
           for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 24004 samples with 1 evaluation
 min    50.000 ns
 median 90.000 ns
 mean   92.675 ns
 max    681.000 ns

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 20405 samples with 1 evaluation
 min    50.000 ns
 median 90.000 ns
 mean   92.634 ns
 max    1.442 μs

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 7178 samples with 1 evaluation
 min    40.000 ns
 median 240.000 ns
 mean   255.817 ns
 max    782.000 ns

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 25737 samples with 1 evaluation
 min    40.000 ns
 median 241.000 ns
 mean   260.301 ns
 max    14.146 μs

graeme-a-stewart commented 1 day ago

Hi @Moelf - yes, this has been on my TODO as well, as we might loose LoopVectorization. My micro-benchmark/testcase is here.

Just running that now on my M2 Pro I get the following:

Julia `findmin`

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  171.583 μs …   6.069 ms  ┊ GC (min … max): 0.00% … 96.63%
 Time  (median):     176.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.996 μs ± 102.615 μs  ┊ GC (mean ± σ):  1.88% ±  3.33%

   ▄█▂  ▂▆▄▆                                                     
  ▇███▄▄█████▆▆▆▇▆▅▅▅▄▄▃▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  172 μs           Histogram: frequency by time          203 μs <

 Memory estimate: 70.34 KiB, allocs estimate: 2001.

Fast `findmin` with `@turbo`

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  47.333 μs …  5.323 ms  ┊ GC (min … max): 0.00% … 98.67%
 Time  (median):     48.250 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.801 μs ± 64.514 μs  ┊ GC (mean ± σ):  2.06% ±  1.69%

  ▁▆██▄▃▁▃▅▅▃▂▁▁▂▄▅▅▃▂▂▂▃▂▁                                   ▂
  ███████████████████████████▇█▇███▇█████▇▇█▇█▇▇▆▆▆▆▅▇▆▆▅▅▅▅▅ █
  47.3 μs      Histogram: log(frequency) by time      60.9 μs <

 Memory estimate: 31.28 KiB, allocs estimate: 1001.

Note that this test shrinks the array by 1 step each time, which correctly mocks up the jet reconstruction case - however, I see much the same result for a test against a fixed array size.

Moelf commented 1 day ago

yeah I can reproduce that, I also opened: https://discourse.julialang.org/t/faster-findmin-without-loopvectorization-jl/121742/3

graeme-a-stewart commented 1 day ago

I saw - thanks. I will pitch in there later.

Moelf commented 23 hours ago

       function naive_findmin(dij, n)
           x = @fastmath foldl(min, dij)
           i = findfirst(==(x), dij)::Int
           x, i
       end

looks quite promising

JuliaHEP / JetReconstruction.jl