Open Moelf opened 3 days ago
related to #41963
actually if you force no inline it's faster...
julia> noinline_argmin(x) = @noinline findmin(x; dims=:)[2]
slower_argmin (generic function with 1 method)
julia> @be rand(Float64, 512000) noinline_argmin samples=100 evals=50
Benchmark: 45 samples with 50 evaluations
min 415.286 μs (2 allocs: 48 bytes)
median 420.158 μs (2 allocs: 48 bytes)
mean 423.787 μs (2 allocs: 48 bytes)
max 476.411 μs (2 allocs: 48 bytes)
Presumably this is just llvm being dumb. Do you see the same issue on master?
yes, still on master:
#Version 1.12.0-DEV.1508 (2024-10-28)
julia> @be rand(512000) slow_argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
min 2.214 ms
median 2.217 ms
mean 2.225 ms
max 2.258 ms
julia> @be rand(512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
min 409.705 μs
median 411.422 μs
mean 415.092 μs
max 451.189 μs
Fwiw doesn't reproduce on (or llvm 19 for that matter)
julia> versioninfo()
Julia Version 1.12.0-DEV.1502
Commit ee09ae70d9f (2024-10-26 01:01 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 1700 Eight-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver1)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
Looks like you're on Zen1, and I can also confirm it doesn't happen on Zen2, but it does happen on Zen4!
julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 36 samples with 50 evaluations
min 544.455 μs
median 545.443 μs
mean 545.703 μs
max 554.404 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 34 samples with 50 evaluations
min 564.647 μs
median 566.045 μs
mean 566.817 μs
max 578.258 μs
julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
min 405.372 μs
median 407.367 μs
mean 409.003 μs
max 451.877 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
min 2.233 ms
median 2.244 ms
mean 2.266 ms
max 2.389 ms
This really sounds like an upstream bug in LLVM:
$ julia +nightly -q
julia> using Chairmarks
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 33 samples with 50 evaluations
min 558.099 μs
median 558.759 μs
mean 558.799 μs
max 559.535 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 7 samples with 50 evaluations
min 3.058 ms
median 3.063 ms
mean 3.063 ms
max 3.067 ms
julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 384 × AMD EPYC 9654 96-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 384 virtual cores)
$ julia +nightly -Cx86_64 -q
julia> using Chairmarks
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 29 samples with 50 evaluations
min 626.947 μs
median 627.900 μs
mean 627.837 μs
max 628.868 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 28 samples with 50 evaluations
min 697.109 μs
median 698.185 μs
mean 698.274 μs
max 699.674 μs
argmin
is over 4x faster when using the generic x86_64 target rather than the native target (znver4).
Maybe it's https://github.com/llvm/llvm-project/issues/91370, i.e. it's generating gather and scatter instructions which are slow.
This is the LLVM module I get on znver4 with Julia nightly (llvm 18.1), and relative native code: https://godbolt.org/z/fPMcMEM48 (unfortunately we can't choose the target in the Julia frontend of godbolt until #52949 is resolved, but you can get the LLVM IR with https://godbolt.org/z/99W5EG7d4 and then copy-paste it as input for LLVM IR). I don't see gather/scatter instructions, but apart from a lone vunpcklpd
instruction there are no other packed instructions. If you change the target to znver3
you get more packed instructions and argmin
performance is also sensibly better:
$ julia +nightly -Cznver3 -q
julia> using Chairmarks
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 35 samples with 50 evaluations
min 557.571 μs
median 557.980 μs
mean 558.067 μs
max 558.717 μs
matching native findmin
https://github.com/JuliaLang/julia/blob/2cdfe062952c3a1168da7545a10bfa0ec205b4db/base/reducedim.jl#L1245-L1246
but:
somehow
[2]
is destorying performance: