`argmin()` is much slower than `findmin()` due to automatic inline on AMD Zen4

Moelf commented 3 days ago

https://github.com/JuliaLang/julia/blob/2cdfe062952c3a1168da7545a10bfa0ec205b4db/base/reducedim.jl#L1245-L1246

but:

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 46 samples with 50 evaluations
 min    411.306 μs
 median 419.671 μs
 mean   419.933 μs
 max    448.662 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.239 ms
 median 2.246 ms
 mean   2.252 ms
 max    2.306 ms

somehow [2] is destorying performance:

julia> slow_argmin(x) = findmin(x; dims=:)[2]
slow_argmin (generic function with 1 method)

julia> fast_argmin(x) = findmin(x; dims=:)
fast_argmin (generic function with 1 method)

julia> @be rand(Float64, 512000) slow_argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.233 ms
 median 2.264 ms
 mean   2.262 ms
 max    2.309 ms

julia> @be rand(Float64, 512000) fast_argmin samples=100 evals=50
Benchmark: 45 samples with 50 evaluations
 min    412.383 μs
 median 421.955 μs
 mean   425.285 μs
 max    475.616 μs

Moelf commented 3 days ago

related to #41963

Moelf commented 3 days ago

actually if you force no inline it's faster...

julia> noinline_argmin(x) = @noinline findmin(x; dims=:)[2]
slower_argmin (generic function with 1 method)

julia> @be rand(Float64, 512000) noinline_argmin samples=100 evals=50
Benchmark: 45 samples with 50 evaluations
 min    415.286 μs (2 allocs: 48 bytes)
 median 420.158 μs (2 allocs: 48 bytes)
 mean   423.787 μs (2 allocs: 48 bytes)
 max    476.411 μs (2 allocs: 48 bytes)

Zentrik commented 3 days ago

Presumably this is just llvm being dumb. Do you see the same issue on master?

Moelf commented 3 days ago

yes, still on master:

#Version 1.12.0-DEV.1508 (2024-10-28)

julia> @be rand(512000) slow_argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.214 ms
 median 2.217 ms
 mean   2.225 ms
 max    2.258 ms

julia> @be rand(512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
 min    409.705 μs
 median 411.422 μs
 mean   415.092 μs
 max    451.189 μs

Zentrik commented 2 days ago

Fwiw doesn't reproduce on (or llvm 19 for that matter)

julia> versioninfo()
Julia Version 1.12.0-DEV.1502
Commit ee09ae70d9f (2024-10-26 01:01 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver1)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Moelf commented 2 days ago

Looks like you're on Zen1, and I can also confirm it doesn't happen on Zen2, but it does happen on Zen4!

julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 36 samples with 50 evaluations
 min    544.455 μs
 median 545.443 μs
 mean   545.703 μs
 max    554.404 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 34 samples with 50 evaluations
 min    564.647 μs
 median 566.045 μs
 mean   566.817 μs
 max    578.258 μs

julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
 min    405.372 μs
 median 407.367 μs
 mean   409.003 μs
 max    451.877 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.233 ms
 median 2.244 ms
 mean   2.266 ms
 max    2.389 ms

giordano commented 2 days ago

This really sounds like an upstream bug in LLVM:

$ julia +nightly -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 33 samples with 50 evaluations
 min    558.099 μs
 median 558.759 μs
 mean   558.799 μs
 max    559.535 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 7 samples with 50 evaluations
 min    3.058 ms
 median 3.063 ms
 mean   3.063 ms
 max    3.067 ms

julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 384 × AMD EPYC 9654 96-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 384 virtual cores)

$ julia +nightly -Cx86_64 -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 29 samples with 50 evaluations
 min    626.947 μs
 median 627.900 μs
 mean   627.837 μs
 max    628.868 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 28 samples with 50 evaluations
 min    697.109 μs
 median 698.185 μs
 mean   698.274 μs
 max    699.674 μs

argmin is over 4x faster when using the generic x86_64 target rather than the native target (znver4).

Zentrik commented 2 days ago

Maybe it's https://github.com/llvm/llvm-project/issues/91370, i.e. it's generating gather and scatter instructions which are slow.

giordano commented 2 days ago

This is the LLVM module I get on znver4 with Julia nightly (llvm 18.1), and relative native code: https://godbolt.org/z/fPMcMEM48 (unfortunately we can't choose the target in the Julia frontend of godbolt until #52949 is resolved, but you can get the LLVM IR with https://godbolt.org/z/99W5EG7d4 and then copy-paste it as input for LLVM IR). I don't see gather/scatter instructions, but apart from a lone vunpcklpd instruction there are no other packed instructions. If you change the target to znver3 you get more packed instructions and argmin performance is also sensibly better:

$ julia +nightly -Cznver3 -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 35 samples with 50 evaluations
 min    557.571 μs
 median 557.980 μs
 mean   558.067 μs
 max    558.717 μs

matching native findmin

JuliaLang / julia

`argmin()` is much slower than `findmin()` due to automatic inline on AMD Zen4 #56375