JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.88k stars 5.49k forks source link

`max_fast` doesn't vectorize for tuples #56341

Open matthias314 opened 1 month ago

matthias314 commented 1 month ago

On the other hand, min_fast looks fine: For

min_tuple(t) = @fastmath foldl(min, t)
max_tuple(t) = @fastmath foldl(max, t)
t = ntuple(Float64, 8)

I get

julia> @code_llvm debuginfo=:none min_tuple(t)
; Function Signature: min_tuple(NTuple{8, Float64})
define double @julia_min_tuple_10628(ptr nocapture noundef nonnull readonly align 8 dereferenceable(64) %"t::Tuple") #0 {
top:
  %0 = load <8 x double>, ptr %"t::Tuple", align 8
  %1 = call fast double @llvm.vector.reduce.fmin.v8f64(<8 x double> %0)
  ret double %1
}

but

julia> @code_llvm debuginfo=:none max_tuple(t)
; Function Signature: max_tuple(NTuple{8, Float64})
define double @julia_max_tuple_10646(ptr nocapture noundef nonnull readonly align 8 dereferenceable(64) %"t::Tuple") #0 {
top:
  %"t::Tuple[2]_ptr" = getelementptr inbounds [8 x double], ptr %"t::Tuple", i64 0, i64 1
  %"t::Tuple[3]_ptr" = getelementptr inbounds [8 x double], ptr %"t::Tuple", i64 0, i64 2
  %"t::Tuple[4]_ptr" = getelementptr inbounds [8 x double], ptr %"t::Tuple", i64 0, i64 3
  %"t::Tuple[5]_ptr" = getelementptr inbounds [8 x double], ptr %"t::Tuple", i64 0, i64 4
  %"t::Tuple[6]_ptr" = getelementptr inbounds [8 x double], ptr %"t::Tuple", i64 0, i64 5
  %"t::Tuple[7]_ptr" = getelementptr inbounds [8 x double], ptr %"t::Tuple", i64 0, i64 6
  %"t::Tuple[8]_ptr" = getelementptr inbounds [8 x double], ptr %"t::Tuple", i64 0, i64 7
  %"t::Tuple[1]_ptr.unbox" = load double, ptr %"t::Tuple", align 8
  %"t::Tuple[2]_ptr.unbox" = load double, ptr %"t::Tuple[2]_ptr", align 8
  %.inv = fcmp fast olt double %"t::Tuple[1]_ptr.unbox", %"t::Tuple[2]_ptr.unbox"
  %0 = select fast i1 %.inv, double %"t::Tuple[2]_ptr.unbox", double %"t::Tuple[1]_ptr.unbox"
  %"t::Tuple[3]_ptr.unbox" = load double, ptr %"t::Tuple[3]_ptr", align 8
  %.inv34 = fcmp fast olt double %0, %"t::Tuple[3]_ptr.unbox"
  %1 = select fast i1 %.inv34, double %"t::Tuple[3]_ptr.unbox", double %0
  %"t::Tuple[4]_ptr.unbox" = load double, ptr %"t::Tuple[4]_ptr", align 8
  %.inv35 = fcmp fast olt double %1, %"t::Tuple[4]_ptr.unbox"
  %2 = select fast i1 %.inv35, double %"t::Tuple[4]_ptr.unbox", double %1
  %"t::Tuple[5]_ptr.unbox" = load double, ptr %"t::Tuple[5]_ptr", align 8
  %.inv36 = fcmp fast olt double %2, %"t::Tuple[5]_ptr.unbox"
  %3 = select fast i1 %.inv36, double %"t::Tuple[5]_ptr.unbox", double %2
  %"t::Tuple[6]_ptr.unbox" = load double, ptr %"t::Tuple[6]_ptr", align 8
  %.inv37 = fcmp fast olt double %3, %"t::Tuple[6]_ptr.unbox"
  %4 = select fast i1 %.inv37, double %"t::Tuple[6]_ptr.unbox", double %3
  %"t::Tuple[7]_ptr.unbox" = load double, ptr %"t::Tuple[7]_ptr", align 8
  %.inv38 = fcmp fast olt double %4, %"t::Tuple[7]_ptr.unbox"
  %5 = select fast i1 %.inv38, double %"t::Tuple[7]_ptr.unbox", double %4
  %"t::Tuple[8]_ptr.unbox" = load double, ptr %"t::Tuple[8]_ptr", align 8
  %.inv39 = fcmp fast olt double %5, %"t::Tuple[8]_ptr.unbox"
  %6 = select fast i1 %.inv39, double %"t::Tuple[8]_ptr.unbox", double %5
  ret double %6
}

This is already at the IR level, so I assume this is not just for a single processor type.

Using max_fast indeed leads to a slowdown for longer tuples:

julia> @b ntuple(Float64, 30) min_tuple(_)
4.076 ns

julia> @b ntuple(Float64, 30) max_tuple(_)
13.283 ns
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
nsajko commented 1 month ago

The code_typed looks OK, example:

julia> min_tuple(t) = @fastmath foldl(min, t)
min_tuple (generic function with 1 method)

julia> max_tuple(t) = @fastmath foldl(max, t)
max_tuple (generic function with 1 method)

julia> code_typed(min_tuple, Tuple{Tuple{Vararg{Float64, 4}}})
1-element Vector{Any}:
 CodeInfo(
1 ─ %1  =   builtin Core.getfield(t, 1)::Float64
│   %2  =   builtin Core.getfield(t, 2)::Float64
│   %3  =   builtin Core.getfield(t, 3)::Float64
│   %4  =   builtin Core.getfield(t, 4)::Float64
│   %5  = intrinsic Base.FastMath.lt_float_fast(%1, %2)::Bool
│   %6  =   builtin Core.ifelse(%5, %1, %2)::Float64
│   %7  = intrinsic Base.FastMath.lt_float_fast(%6, %3)::Bool
│   %8  =   builtin Core.ifelse(%7, %6, %3)::Float64
│   %9  = intrinsic Base.FastMath.lt_float_fast(%8, %4)::Bool
│   %10 =   builtin Core.ifelse(%9, %8, %4)::Float64
└──       return %10
) => Float64

julia> code_typed(max_tuple, Tuple{Tuple{Vararg{Float64, 4}}})
1-element Vector{Any}:
 CodeInfo(
1 ─ %1  =   builtin Core.getfield(t, 1)::Float64
│   %2  =   builtin Core.getfield(t, 2)::Float64
│   %3  =   builtin Core.getfield(t, 3)::Float64
│   %4  =   builtin Core.getfield(t, 4)::Float64
│   %5  = intrinsic Base.FastMath.lt_float_fast(%1, %2)::Bool
│   %6  =   builtin Core.ifelse(%5, %2, %1)::Float64
│   %7  = intrinsic Base.FastMath.lt_float_fast(%6, %3)::Bool
│   %8  =   builtin Core.ifelse(%7, %3, %6)::Float64
│   %9  = intrinsic Base.FastMath.lt_float_fast(%8, %4)::Bool
│   %10 =   builtin Core.ifelse(%9, %4, %8)::Float64
└──       return %10
) => Float64

I think the way forward may perhaps be to try creating a C/C++ reproducer, feeding it to Clang, and then reporting an LLVM bug if it reproduces for Clang.

Zentrik commented 1 month ago

Looks like the LLVM 19 upgrade will fix this, https://godbolt.org/z/sn3nsKzrv.

matthias314 commented 1 week ago

I don't see an improvement with LLVM 19. Using @Zentrik's llvm-19-actual branch (which is hopefully the right one), I get

julia> @b ntuple(Float64, 30) min_tuple(_)
3.881 ns

julia> @b ntuple(Float64, 30) max_tuple(_)
12.958 ns

as before. The output of @code_llvm is also as before.

Julia Version 1.12.0-DEV.1552
Commit d78f156a25 (2024-11-03 11:24 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-19.1.1 (ORCJIT, skylake)