JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.49k forks source link

`@simd` regression on Julia 1.10-beta3 when using `--check-bounds=yes` #51814

Open chriselrod opened 1 year ago

chriselrod commented 1 year ago

Julia 1.10-beta3:

julia> function mydot(x,y)
           acc = zero(eltype(x))
           @simd for i = eachindex(x,y)
                acc += x[i]*y[i]
           end
           acc
       end
mydot (generic function with 1 method)

julia> function mydot_fm(x,y)
           acc = zero(eltype(x))
           @fastmath for i = eachindex(x,y)
                acc += x[i]*y[i]
           end
           acc
       end
mydot_fm (generic function with 1 method)

julia> x = rand(256); y = x;

julia> @btime mydot($x,$y)
  53.266 ns (0 allocations: 0 bytes)
87.20858931570895

julia> @btime mydot_fm($x,$y)
  16.665 ns (0 allocations: 0 bytes)
87.20858931570893

On 1.9.3, I get

julia> @btime mydot($x,$y)
  17.875 ns (0 allocations: 0 bytes)
83.2453858603898

julia> @btime mydot_fm($x,$y)
  17.470 ns (0 allocations: 0 bytes)
83.2453858603898
chriselrod commented 1 year ago

It seems to be peeling a single iteration, so power of 2s are particularly impacted in performance, as you end up spending a lot of time in the scalar iterations:

julia> x = rand(33); y = x;

julia> @btime mydot($x,$y)
  10.161 ns (0 allocations: 0 bytes)
10.41065848689856

julia> @btime mydot_fm($x,$y)
  9.526 ns (0 allocations: 0 bytes)
10.410658486898559

julia> x = rand(32); y = x;

julia> @btime mydot($x,$y)
  24.680 ns (0 allocations: 0 bytes)
11.942189618595611

julia> @btime mydot_fm($x,$y)
  7.401 ns (0 allocations: 0 bytes)
11.942189618595613

I.e., when we have 33 iterations, performance is comparable, as both @fastmath and @simd do 32 iterations in a vector block, and a single scalar iteration. However, if we do only 32 iterations total, while @fastmath does the right thing and vectorizes all of them, @simd finds that 32-1 = 31 is too short for the vector block, and ends up doing all 32 iterations under the scalar code path.

chriselrod commented 1 year ago
L24:                                              ; preds = %top
  %.not36 = icmp eq i64 %arraylen, 0
  br i1 %.not36, label %L50, label %L31.preheader

L31.preheader:                                    ; preds = %L24
  %23 = bitcast {}* %0 to double**
  %arrayptr38 = load double*, double** %23, align 8
  %24 = bitcast {}* %1 to double**
  %arrayptr1439 = load double*, double** %24, align 8
  %25 = add nsw i64 %arraylen, -1
  %umin = call i64 @llvm.umin.i64(i64 %arraylen, i64 %25)
  %min.iters.check = icmp ult i64 %umin, 32
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L31.preheader
  %26 = add nuw nsw i64 %umin, 1
  %n.mod.vf = and i64 %26, 31
  %27 = icmp eq i64 %n.mod.vf, 0
  %28 = select i1 %27, i64 32, i64 %n.mod.vf
  %n.vec = sub nsw i64 %26, %28
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x double> [ <double 0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %49, %vector.body ]
  %vec.phi79 = phi <8 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %50, %vector.body ]
  %vec.phi80 = phi <8 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %51, %vector.body ]
  %vec.phi81 = phi <8 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %52, %vector.body ]
  %29 = getelementptr inbounds double, double* %arrayptr38, i64 %index
  %30 = bitcast double* %29 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %30, align 8
  %31 = getelementptr inbounds double, double* %29, i64 8
  %32 = bitcast double* %31 to <8 x double>*
  %wide.load82 = load <8 x double>, <8 x double>* %32, align 8
  %33 = getelementptr inbounds double, double* %29, i64 16
  %34 = bitcast double* %33 to <8 x double>*
  %wide.load83 = load <8 x double>, <8 x double>* %34, align 8
  %35 = getelementptr inbounds double, double* %29, i64 24
  %36 = bitcast double* %35 to <8 x double>*
  %wide.load84 = load <8 x double>, <8 x double>* %36, align 8
  %37 = getelementptr inbounds double, double* %arrayptr1439, i64 %index
  %38 = bitcast double* %37 to <8 x double>*
  %wide.load85 = load <8 x double>, <8 x double>* %38, align 8
  %39 = getelementptr inbounds double, double* %37, i64 8
  %40 = bitcast double* %39 to <8 x double>*
  %wide.load86 = load <8 x double>, <8 x double>* %40, align 8
  %41 = getelementptr inbounds double, double* %37, i64 16
  %42 = bitcast double* %41 to <8 x double>*
  %wide.load87 = load <8 x double>, <8 x double>* %42, align 8
  %43 = getelementptr inbounds double, double* %37, i64 24
  %44 = bitcast double* %43 to <8 x double>*
  %wide.load88 = load <8 x double>, <8 x double>* %44, align 8
  %45 = fmul contract <8 x double> %wide.load, %wide.load85
  %46 = fmul contract <8 x double> %wide.load82, %wide.load86
  %47 = fmul contract <8 x double> %wide.load83, %wide.load87
  %48 = fmul contract <8 x double> %wide.load84, %wide.load88
  %49 = fadd reassoc contract <8 x double> %vec.phi, %45
  %50 = fadd reassoc contract <8 x double> %vec.phi79, %46
  %51 = fadd reassoc contract <8 x double> %vec.phi80, %47
  %52 = fadd reassoc contract <8 x double> %vec.phi81, %48
  %index.next = add nuw i64 %index, 32
  %53 = icmp eq i64 %index.next, %n.vec
  br i1 %53, label %middle.block, label %vector.body

Note that the preheader has

  %25 = add nsw i64 %arraylen, -1
  %umin = call i64 @llvm.umin.i64(i64 %arraylen, i64 %25)
  %min.iters.check = icmp ult i64 %umin, 32
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

First of all, the umin is annoying, because (a) the -1 had nsw when it should have had nw because we know arraylen != 0 (see branch in L24), and hence the -1 isn't going to cause unsigned wrapping.

But the more important problem is that we compare

if %umin < 32 # if arraylen - 1 < 32, i.e. if arraylen < 33
    scalar
else
    vector
end

while the vector loop itself breaks when

x = umin + 1
r = x % 32
n_vec = x - (r==0 ? 32 : r)
index = 0
while true
    index += 32
    index == n_vec && break
end

yikes! So, whenever we have a multiple of 32, it does a full 32 scalar iterations!!! It should be

x = umin + 1
r = x % 32
n_vec = x - (r==0 ? 0 : r)
index = 0

i.e.

vector.ph:                                        ; preds = %L31.preheader
  %26 = add nuw nsw i64 %umin, 1
  %n.mod.vf = and i64 %26, 31
  %27 = icmp eq i64 %n.mod.vf, 0
  %28 = select i1 %27, i64 0, i64 %n.mod.vf
  %n.vec = sub nsw i64 %26, %28
  br label %vector.body

Naively, I'd guess this is a regression in llvm 15.07 (used by Julia 1.10-beta3) compared to llvm 14.07 (used by Julia 1.9.3).

giordano commented 1 year ago

https://github.com/JuliaLang/julia/blob/fb01dd28a5fbabdae6cb5f23c8493db9a377fda3/HISTORY.md#L24-L27

chriselrod commented 1 year ago

https://github.com/JuliaLang/julia/blob/fb01dd28a5fbabdae6cb5f23c8493db9a377fda3/HISTORY.md#L24-L27

@giordano, that is unrelated.

It is still using fma instructions and reordering here, it is just doing SIMD_WIDTH*UNROLL_FACTOR scalar iterations whenever we have a number of iterations that is a multiple of this, when it should be doing 0. I.e., when this number is 32 and we have 256 iterations, it should be doing 256/32 vector body iterations and 0 scalar iterations, and not (256 - 32) / 32 vector iterations + 32 scalar iterations.

Zentrik commented 1 year ago

I don't see this problem on 1.10-beta3, the LLVM IR I get seems to be quite different notably:

L24:                                              ; preds = %top
  %.not36 = icmp eq i64 %arraylen, 0
  br i1 %.not36, label %L50, label %L31.preheader

L31.preheader:                                    ; preds = %L24
  %22 = bitcast {}* %0 to double**
  %arrayptr38 = load double*, double** %22, align 8
  %23 = bitcast {}* %1 to double**
  %arrayptr1439 = load double*, double** %23, align 8
  %min.iters.check = icmp ult i64 %arraylen, 16
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L31.preheader
  %n.vec = and i64 %arraylen, 9223372036854775792
  br label %vector.body

Full IR here

julia> @code_llvm debuginfo=:none mydot(x, y)
; Function Attrs: uwtable
define double @julia_mydot_847({}* noundef nonnull align 16 dereferenceable(40) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
  %2 = alloca [3 x {}*], align 8
  %gcframe81 = alloca [4 x {}*], align 16
  %gcframe81.sub = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe81, i64 0, i64 0
  %3 = bitcast [4 x {}*]* %gcframe81 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 16 %3, i8 0, i64 32, i1 true)
  %4 = call {}*** inttoptr (i64 140735745018800 to {}*** ()*)() #13
  %5 = bitcast [4 x {}*]* %gcframe81 to i64*
  store i64 8, i64* %5, align 16
  %6 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe81, i64 0, i64 1
  %7 = bitcast {}** %6 to {}***
  %8 = load {}**, {}*** %4, align 8
  store {}** %8, {}*** %7, align 8
  %9 = bitcast {}*** %4 to {}***
  store {}** %gcframe81.sub, {}*** %9, align 8
  %10 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
  %arraylen_ptr = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %10, i64 0, i32 1
  %arraylen = load i64, i64* %arraylen_ptr, align 8
  %11 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
  %arraylen_ptr1 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %11, i64 0, i32 1
  %arraylen2 = load i64, i64* %arraylen_ptr1, align 8
  %.not = icmp eq i64 %arraylen, %arraylen2
  br i1 %.not, label %L24, label %L12

L12:                                              ; preds = %top
  %.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %2, i64 0, i64 0
  %ptls_field82 = getelementptr inbounds {}**, {}*** %4, i64 2
  %12 = bitcast {}*** %ptls_field82 to i8**
  %ptls_load8384 = load i8*, i8** %12, align 8
  %box = call noalias nonnull dereferenceable(16) {}* @ijl_gc_pool_alloc(i8* %ptls_load8384, i32 1136, i32 16) #11
  %13 = bitcast {}* %box to i64*
  %14 = getelementptr inbounds i64, i64* %13, i64 -1
  store atomic i64 140735255174496, i64* %14 unordered, align 8
  store i64 %arraylen, i64* %13, align 8
  %15 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe81, i64 0, i64 3
  store {}* %box, {}** %15, align 8
  %ptls_load798586 = load i8*, i8** %12, align 8
  %box28 = call noalias nonnull dereferenceable(16) {}* @ijl_gc_pool_alloc(i8* %ptls_load798586, i32 1136, i32 16) #11
  %16 = bitcast {}* %box28 to i64*
  %17 = getelementptr inbounds i64, i64* %16, i64 -1
  store atomic i64 140735255174496, i64* %17 unordered, align 8
  store i64 %arraylen2, i64* %16, align 8
  %18 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe81, i64 0, i64 2
  store {}* %box28, {}** %18, align 16
  store {}* inttoptr (i64 140735203794032 to {}*), {}** %.sub, align 8
  %19 = getelementptr inbounds [3 x {}*], [3 x {}*]* %2, i64 0, i64 1
  store {}* %box, {}** %19, align 8
  %20 = getelementptr inbounds [3 x {}*], [3 x {}*]* %2, i64 0, i64 2
  store {}* %box28, {}** %20, align 8
  %21 = call nonnull {}* @ijl_invoke({}* inttoptr (i64 140735273431664 to {}*), {}** nonnull %.sub, i32 3, {}* inttoptr (i64 2767574489760 to {}*))
  call void @llvm.trap()
  unreachable

L24:                                              ; preds = %top
  %.not36 = icmp eq i64 %arraylen, 0
  br i1 %.not36, label %L50, label %L31.preheader

L31.preheader:                                    ; preds = %L24
  %22 = bitcast {}* %0 to double**
  %arrayptr38 = load double*, double** %22, align 8
  %23 = bitcast {}* %1 to double**
  %arrayptr1439 = load double*, double** %23, align 8
  %min.iters.check = icmp ult i64 %arraylen, 16
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L31.preheader
  %n.vec = and i64 %arraylen, 9223372036854775792
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x double> [ <double 0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>,
 %vector.ph ], [ %44, %vector.body ]
  %vec.phi61 = phi <4 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %45, %vector.body ]
  %vec.phi62 = phi <4 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %46, %vector.body ]
  %vec.phi63 = phi <4 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %47, %vector.body ]
  %24 = getelementptr inbounds double, double* %arrayptr38, i64 %index
  %25 = bitcast double* %24 to <4 x double>*
  %wide.load = load <4 x double>, <4 x double>* %25, align 8
  %26 = getelementptr inbounds double, double* %24, i64 4
  %27 = bitcast double* %26 to <4 x double>*
  %wide.load64 = load <4 x double>, <4 x double>* %27, align 8
  %28 = getelementptr inbounds double, double* %24, i64 8
  %29 = bitcast double* %28 to <4 x double>*
  %wide.load65 = load <4 x double>, <4 x double>* %29, align 8
  %30 = getelementptr inbounds double, double* %24, i64 12
  %31 = bitcast double* %30 to <4 x double>*
  %wide.load66 = load <4 x double>, <4 x double>* %31, align 8
  %32 = getelementptr inbounds double, double* %arrayptr1439, i64 %index
  %33 = bitcast double* %32 to <4 x double>*
  %wide.load67 = load <4 x double>, <4 x double>* %33, align 8
  %34 = getelementptr inbounds double, double* %32, i64 4
  %35 = bitcast double* %34 to <4 x double>*
  %wide.load68 = load <4 x double>, <4 x double>* %35, align 8
  %36 = getelementptr inbounds double, double* %32, i64 8
  %37 = bitcast double* %36 to <4 x double>*
  %wide.load69 = load <4 x double>, <4 x double>* %37, align 8
  %38 = getelementptr inbounds double, double* %32, i64 12
  %39 = bitcast double* %38 to <4 x double>*
  %wide.load70 = load <4 x double>, <4 x double>* %39, align 8
  %40 = fmul contract <4 x double> %wide.load, %wide.load67
  %41 = fmul contract <4 x double> %wide.load64, %wide.load68
  %42 = fmul contract <4 x double> %wide.load65, %wide.load69
  %43 = fmul contract <4 x double> %wide.load66, %wide.load70
  %44 = fadd reassoc contract <4 x double> %vec.phi, %40
  %45 = fadd reassoc contract <4 x double> %vec.phi61, %41
  %46 = fadd reassoc contract <4 x double> %vec.phi62, %42
  %47 = fadd reassoc contract <4 x double> %vec.phi63, %43
  %index.next = add nuw i64 %index, 16
  %48 = icmp eq i64 %index.next, %n.vec
  br i1 %48, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd reassoc contract <4 x double> %45, %44
  %bin.rdx71 = fadd reassoc contract <4 x double> %46, %bin.rdx
  %bin.rdx72 = fadd reassoc contract <4 x double> %47, %bin.rdx71
  %49 = call reassoc contract double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> %bin.rdx72)
  %cmp.n = icmp eq i64 %arraylen, %n.vec
  br i1 %cmp.n, label %L50, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L31.preheader
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L31.preheader ]
  %bc.merge.rdx = phi double [ %49, %middle.block ], [ 0.000000e+00, %L31.preheader ]
  br label %idxend12

L50:                                              ; preds = %idxend12, %middle.block, %L24
  %value_phi16 = phi double [ 0.000000e+00, %L24 ], [ %49, %middle.block ], [ %56, %idxend12 ]
  %50 = load {}*, {}** %6, align 8
  %51 = bitcast {}*** %4 to {}**
  store {}* %50, {}** %51, align 8
  ret double %value_phi16

idxend12:                                         ; preds = %idxend12, %scalar.ph
  %value_phi447 = phi i64 [ %52, %idxend12 ], [ %bc.resume.val, %scalar.ph ]
  %value_phi46 = phi double [ %56, %idxend12 ], [ %bc.merge.rdx, %scalar.ph ]
  %52 = add nuw nsw i64 %value_phi447, 1
  %53 = getelementptr inbounds double, double* %arrayptr38, i64 %value_phi447
  %arrayref = load double, double* %53, align 8
  %54 = getelementptr inbounds double, double* %arrayptr1439, i64 %value_phi447
  %arrayref15 = load double, double* %54, align 8
  %55 = fmul contract double %arrayref, %arrayref15
  %56 = fadd reassoc contract double %value_phi46, %55
  %exitcond.not = icmp eq i64 %52, %arraylen
  br i1 %exitcond.not, label %L50, label %idxend12
}

I have avx 256

Julia Version 1.10.0-beta3
Commit 404750f858 (2023-10-03 12:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × AMD Ryzen 7 5700U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 1 on 16 virtual cores
oscardssmith commented 1 year ago

I think this is likely an avx-512 specific regression.

chriselrod commented 1 year ago

I think this is likely an avx-512 specific regression.

@brenhinkeller reported not having this regression on his M1.

I'm using -Cnative,-prefer-256-bit and didn't confirm whethet I get the regression under default settings.

brenhinkeller commented 1 year ago

Yeah, I get equivalent performance between both functions on M1 both in 1.9.3 and 1.10.0-beta3 – though possibly just because neither seems to be vectorizing well there AFAICT?

chriselrod commented 1 year ago

I can only reproduce this when using --check-bounds=yes, which I use more often than not.

Maybe it is limited to AVX512, but it definitely is not limited to 512 bit, using 256 bit vectors:

julia> @btime mydot($x,$y)
  40.730 ns (0 allocations: 0 bytes)
88.99071159759116

julia> @btime mydot_fm($x,$y)
  24.554 ns (0 allocations: 0 bytes)
88.99071159759114

vs 512

julia> @btime mydot($x,$y)
  53.122 ns (0 allocations: 0 bytes)
87.47595130161963

julia> @btime mydot_fm($x,$y)
  17.608 ns (0 allocations: 0 bytes)
87.47595130161963
Zentrik commented 1 year ago

I can reproduce with --check-bounds=yes, the LLVM IR looks identical to what you posted earlier.