Open matthias314 opened 1 month ago
The code_typed
looks OK, example:
julia> min_tuple(t) = @fastmath foldl(min, t)
min_tuple (generic function with 1 method)
julia> max_tuple(t) = @fastmath foldl(max, t)
max_tuple (generic function with 1 method)
julia> code_typed(min_tuple, Tuple{Tuple{Vararg{Float64, 4}}})
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = builtin Core.getfield(t, 1)::Float64
│ %2 = builtin Core.getfield(t, 2)::Float64
│ %3 = builtin Core.getfield(t, 3)::Float64
│ %4 = builtin Core.getfield(t, 4)::Float64
│ %5 = intrinsic Base.FastMath.lt_float_fast(%1, %2)::Bool
│ %6 = builtin Core.ifelse(%5, %1, %2)::Float64
│ %7 = intrinsic Base.FastMath.lt_float_fast(%6, %3)::Bool
│ %8 = builtin Core.ifelse(%7, %6, %3)::Float64
│ %9 = intrinsic Base.FastMath.lt_float_fast(%8, %4)::Bool
│ %10 = builtin Core.ifelse(%9, %8, %4)::Float64
└── return %10
) => Float64
julia> code_typed(max_tuple, Tuple{Tuple{Vararg{Float64, 4}}})
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = builtin Core.getfield(t, 1)::Float64
│ %2 = builtin Core.getfield(t, 2)::Float64
│ %3 = builtin Core.getfield(t, 3)::Float64
│ %4 = builtin Core.getfield(t, 4)::Float64
│ %5 = intrinsic Base.FastMath.lt_float_fast(%1, %2)::Bool
│ %6 = builtin Core.ifelse(%5, %2, %1)::Float64
│ %7 = intrinsic Base.FastMath.lt_float_fast(%6, %3)::Bool
│ %8 = builtin Core.ifelse(%7, %3, %6)::Float64
│ %9 = intrinsic Base.FastMath.lt_float_fast(%8, %4)::Bool
│ %10 = builtin Core.ifelse(%9, %4, %8)::Float64
└── return %10
) => Float64
I think the way forward may perhaps be to try creating a C/C++ reproducer, feeding it to Clang, and then reporting an LLVM bug if it reproduces for Clang.
Looks like the LLVM 19 upgrade will fix this, https://godbolt.org/z/sn3nsKzrv.
I don't see an improvement with LLVM 19. Using @Zentrik's llvm-19-actual branch (which is hopefully the right one), I get
julia> @b ntuple(Float64, 30) min_tuple(_)
3.881 ns
julia> @b ntuple(Float64, 30) max_tuple(_)
12.958 ns
as before. The output of @code_llvm
is also as before.
Julia Version 1.12.0-DEV.1552
Commit d78f156a25 (2024-11-03 11:24 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
WORD_SIZE: 64
LLVM: libLLVM-19.1.1 (ORCJIT, skylake)
On the other hand,
min_fast
looks fine: ForI get
but
This is already at the IR level, so I assume this is not just for a single processor type.
Using
max_fast
indeed leads to a slowdown for longer tuples: