Open chriselrod opened 1 year ago
It seems to be peeling a single iteration, so power of 2s are particularly impacted in performance, as you end up spending a lot of time in the scalar iterations:
julia> x = rand(33); y = x;
julia> @btime mydot($x,$y)
10.161 ns (0 allocations: 0 bytes)
10.41065848689856
julia> @btime mydot_fm($x,$y)
9.526 ns (0 allocations: 0 bytes)
10.410658486898559
julia> x = rand(32); y = x;
julia> @btime mydot($x,$y)
24.680 ns (0 allocations: 0 bytes)
11.942189618595611
julia> @btime mydot_fm($x,$y)
7.401 ns (0 allocations: 0 bytes)
11.942189618595613
I.e., when we have 33
iterations, performance is comparable, as both @fastmath
and @simd
do 32 iterations in a vector block, and a single scalar iteration.
However, if we do only 32 iterations total, while @fastmath
does the right thing and vectorizes all of them, @simd
finds that 32-1 = 31
is too short for the vector block, and ends up doing all 32 iterations under the scalar code path.
L24: ; preds = %top
%.not36 = icmp eq i64 %arraylen, 0
br i1 %.not36, label %L50, label %L31.preheader
L31.preheader: ; preds = %L24
%23 = bitcast {}* %0 to double**
%arrayptr38 = load double*, double** %23, align 8
%24 = bitcast {}* %1 to double**
%arrayptr1439 = load double*, double** %24, align 8
%25 = add nsw i64 %arraylen, -1
%umin = call i64 @llvm.umin.i64(i64 %arraylen, i64 %25)
%min.iters.check = icmp ult i64 %umin, 32
br i1 %min.iters.check, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %L31.preheader
%26 = add nuw nsw i64 %umin, 1
%n.mod.vf = and i64 %26, 31
%27 = icmp eq i64 %n.mod.vf, 0
%28 = select i1 %27, i64 32, i64 %n.mod.vf
%n.vec = sub nsw i64 %26, %28
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x double> [ <double 0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %49, %vector.body ]
%vec.phi79 = phi <8 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %50, %vector.body ]
%vec.phi80 = phi <8 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %51, %vector.body ]
%vec.phi81 = phi <8 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %52, %vector.body ]
%29 = getelementptr inbounds double, double* %arrayptr38, i64 %index
%30 = bitcast double* %29 to <8 x double>*
%wide.load = load <8 x double>, <8 x double>* %30, align 8
%31 = getelementptr inbounds double, double* %29, i64 8
%32 = bitcast double* %31 to <8 x double>*
%wide.load82 = load <8 x double>, <8 x double>* %32, align 8
%33 = getelementptr inbounds double, double* %29, i64 16
%34 = bitcast double* %33 to <8 x double>*
%wide.load83 = load <8 x double>, <8 x double>* %34, align 8
%35 = getelementptr inbounds double, double* %29, i64 24
%36 = bitcast double* %35 to <8 x double>*
%wide.load84 = load <8 x double>, <8 x double>* %36, align 8
%37 = getelementptr inbounds double, double* %arrayptr1439, i64 %index
%38 = bitcast double* %37 to <8 x double>*
%wide.load85 = load <8 x double>, <8 x double>* %38, align 8
%39 = getelementptr inbounds double, double* %37, i64 8
%40 = bitcast double* %39 to <8 x double>*
%wide.load86 = load <8 x double>, <8 x double>* %40, align 8
%41 = getelementptr inbounds double, double* %37, i64 16
%42 = bitcast double* %41 to <8 x double>*
%wide.load87 = load <8 x double>, <8 x double>* %42, align 8
%43 = getelementptr inbounds double, double* %37, i64 24
%44 = bitcast double* %43 to <8 x double>*
%wide.load88 = load <8 x double>, <8 x double>* %44, align 8
%45 = fmul contract <8 x double> %wide.load, %wide.load85
%46 = fmul contract <8 x double> %wide.load82, %wide.load86
%47 = fmul contract <8 x double> %wide.load83, %wide.load87
%48 = fmul contract <8 x double> %wide.load84, %wide.load88
%49 = fadd reassoc contract <8 x double> %vec.phi, %45
%50 = fadd reassoc contract <8 x double> %vec.phi79, %46
%51 = fadd reassoc contract <8 x double> %vec.phi80, %47
%52 = fadd reassoc contract <8 x double> %vec.phi81, %48
%index.next = add nuw i64 %index, 32
%53 = icmp eq i64 %index.next, %n.vec
br i1 %53, label %middle.block, label %vector.body
Note that the preheader has
%25 = add nsw i64 %arraylen, -1
%umin = call i64 @llvm.umin.i64(i64 %arraylen, i64 %25)
%min.iters.check = icmp ult i64 %umin, 32
br i1 %min.iters.check, label %scalar.ph, label %vector.ph
First of all, the umin
is annoying, because (a) the -1
had nsw
when it should have had nw
because we know arraylen != 0
(see branch in L24
), and hence the -1
isn't going to cause unsigned wrapping.
But the more important problem is that we compare
if %umin < 32 # if arraylen - 1 < 32, i.e. if arraylen < 33
scalar
else
vector
end
while the vector loop itself breaks when
x = umin + 1
r = x % 32
n_vec = x - (r==0 ? 32 : r)
index = 0
while true
index += 32
index == n_vec && break
end
yikes! So, whenever we have a multiple of 32
, it does a full 32 scalar iterations!!!
It should be
x = umin + 1
r = x % 32
n_vec = x - (r==0 ? 0 : r)
index = 0
i.e.
vector.ph: ; preds = %L31.preheader
%26 = add nuw nsw i64 %umin, 1
%n.mod.vf = and i64 %26, 31
%27 = icmp eq i64 %n.mod.vf, 0
%28 = select i1 %27, i64 0, i64 %n.mod.vf
%n.vec = sub nsw i64 %26, %28
br label %vector.body
Naively, I'd guess this is a regression in llvm 15.07 (used by Julia 1.10-beta3) compared to llvm 14.07 (used by Julia 1.9.3).
https://github.com/JuliaLang/julia/blob/fb01dd28a5fbabdae6cb5f23c8493db9a377fda3/HISTORY.md#L24-L27
@giordano, that is unrelated.
It is still using fma
instructions and reordering here, it is just doing SIMD_WIDTH*UNROLL_FACTOR scalar iterations whenever we have a number of iterations that is a multiple of this, when it should be doing 0
.
I.e., when this number is 32 and we have 256 iterations, it should be doing 256/32 vector body iterations and 0 scalar iterations, and not (256 - 32) / 32 vector iterations + 32 scalar iterations.
I don't see this problem on 1.10-beta3, the LLVM IR I get seems to be quite different notably:
L24: ; preds = %top
%.not36 = icmp eq i64 %arraylen, 0
br i1 %.not36, label %L50, label %L31.preheader
L31.preheader: ; preds = %L24
%22 = bitcast {}* %0 to double**
%arrayptr38 = load double*, double** %22, align 8
%23 = bitcast {}* %1 to double**
%arrayptr1439 = load double*, double** %23, align 8
%min.iters.check = icmp ult i64 %arraylen, 16
br i1 %min.iters.check, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %L31.preheader
%n.vec = and i64 %arraylen, 9223372036854775792
br label %vector.body
Full IR here
julia> @code_llvm debuginfo=:none mydot(x, y)
; Function Attrs: uwtable
define double @julia_mydot_847({}* noundef nonnull align 16 dereferenceable(40) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
%2 = alloca [3 x {}*], align 8
%gcframe81 = alloca [4 x {}*], align 16
%gcframe81.sub = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe81, i64 0, i64 0
%3 = bitcast [4 x {}*]* %gcframe81 to i8*
call void @llvm.memset.p0i8.i64(i8* align 16 %3, i8 0, i64 32, i1 true)
%4 = call {}*** inttoptr (i64 140735745018800 to {}*** ()*)() #13
%5 = bitcast [4 x {}*]* %gcframe81 to i64*
store i64 8, i64* %5, align 16
%6 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe81, i64 0, i64 1
%7 = bitcast {}** %6 to {}***
%8 = load {}**, {}*** %4, align 8
store {}** %8, {}*** %7, align 8
%9 = bitcast {}*** %4 to {}***
store {}** %gcframe81.sub, {}*** %9, align 8
%10 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%arraylen_ptr = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %10, i64 0, i32 1
%arraylen = load i64, i64* %arraylen_ptr, align 8
%11 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
%arraylen_ptr1 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %11, i64 0, i32 1
%arraylen2 = load i64, i64* %arraylen_ptr1, align 8
%.not = icmp eq i64 %arraylen, %arraylen2
br i1 %.not, label %L24, label %L12
L12: ; preds = %top
%.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %2, i64 0, i64 0
%ptls_field82 = getelementptr inbounds {}**, {}*** %4, i64 2
%12 = bitcast {}*** %ptls_field82 to i8**
%ptls_load8384 = load i8*, i8** %12, align 8
%box = call noalias nonnull dereferenceable(16) {}* @ijl_gc_pool_alloc(i8* %ptls_load8384, i32 1136, i32 16) #11
%13 = bitcast {}* %box to i64*
%14 = getelementptr inbounds i64, i64* %13, i64 -1
store atomic i64 140735255174496, i64* %14 unordered, align 8
store i64 %arraylen, i64* %13, align 8
%15 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe81, i64 0, i64 3
store {}* %box, {}** %15, align 8
%ptls_load798586 = load i8*, i8** %12, align 8
%box28 = call noalias nonnull dereferenceable(16) {}* @ijl_gc_pool_alloc(i8* %ptls_load798586, i32 1136, i32 16) #11
%16 = bitcast {}* %box28 to i64*
%17 = getelementptr inbounds i64, i64* %16, i64 -1
store atomic i64 140735255174496, i64* %17 unordered, align 8
store i64 %arraylen2, i64* %16, align 8
%18 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe81, i64 0, i64 2
store {}* %box28, {}** %18, align 16
store {}* inttoptr (i64 140735203794032 to {}*), {}** %.sub, align 8
%19 = getelementptr inbounds [3 x {}*], [3 x {}*]* %2, i64 0, i64 1
store {}* %box, {}** %19, align 8
%20 = getelementptr inbounds [3 x {}*], [3 x {}*]* %2, i64 0, i64 2
store {}* %box28, {}** %20, align 8
%21 = call nonnull {}* @ijl_invoke({}* inttoptr (i64 140735273431664 to {}*), {}** nonnull %.sub, i32 3, {}* inttoptr (i64 2767574489760 to {}*))
call void @llvm.trap()
unreachable
L24: ; preds = %top
%.not36 = icmp eq i64 %arraylen, 0
br i1 %.not36, label %L50, label %L31.preheader
L31.preheader: ; preds = %L24
%22 = bitcast {}* %0 to double**
%arrayptr38 = load double*, double** %22, align 8
%23 = bitcast {}* %1 to double**
%arrayptr1439 = load double*, double** %23, align 8
%min.iters.check = icmp ult i64 %arraylen, 16
br i1 %min.iters.check, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %L31.preheader
%n.vec = and i64 %arraylen, 9223372036854775792
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x double> [ <double 0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>,
%vector.ph ], [ %44, %vector.body ]
%vec.phi61 = phi <4 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %45, %vector.body ]
%vec.phi62 = phi <4 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %46, %vector.body ]
%vec.phi63 = phi <4 x double> [ <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %vector.ph ], [ %47, %vector.body ]
%24 = getelementptr inbounds double, double* %arrayptr38, i64 %index
%25 = bitcast double* %24 to <4 x double>*
%wide.load = load <4 x double>, <4 x double>* %25, align 8
%26 = getelementptr inbounds double, double* %24, i64 4
%27 = bitcast double* %26 to <4 x double>*
%wide.load64 = load <4 x double>, <4 x double>* %27, align 8
%28 = getelementptr inbounds double, double* %24, i64 8
%29 = bitcast double* %28 to <4 x double>*
%wide.load65 = load <4 x double>, <4 x double>* %29, align 8
%30 = getelementptr inbounds double, double* %24, i64 12
%31 = bitcast double* %30 to <4 x double>*
%wide.load66 = load <4 x double>, <4 x double>* %31, align 8
%32 = getelementptr inbounds double, double* %arrayptr1439, i64 %index
%33 = bitcast double* %32 to <4 x double>*
%wide.load67 = load <4 x double>, <4 x double>* %33, align 8
%34 = getelementptr inbounds double, double* %32, i64 4
%35 = bitcast double* %34 to <4 x double>*
%wide.load68 = load <4 x double>, <4 x double>* %35, align 8
%36 = getelementptr inbounds double, double* %32, i64 8
%37 = bitcast double* %36 to <4 x double>*
%wide.load69 = load <4 x double>, <4 x double>* %37, align 8
%38 = getelementptr inbounds double, double* %32, i64 12
%39 = bitcast double* %38 to <4 x double>*
%wide.load70 = load <4 x double>, <4 x double>* %39, align 8
%40 = fmul contract <4 x double> %wide.load, %wide.load67
%41 = fmul contract <4 x double> %wide.load64, %wide.load68
%42 = fmul contract <4 x double> %wide.load65, %wide.load69
%43 = fmul contract <4 x double> %wide.load66, %wide.load70
%44 = fadd reassoc contract <4 x double> %vec.phi, %40
%45 = fadd reassoc contract <4 x double> %vec.phi61, %41
%46 = fadd reassoc contract <4 x double> %vec.phi62, %42
%47 = fadd reassoc contract <4 x double> %vec.phi63, %43
%index.next = add nuw i64 %index, 16
%48 = icmp eq i64 %index.next, %n.vec
br i1 %48, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd reassoc contract <4 x double> %45, %44
%bin.rdx71 = fadd reassoc contract <4 x double> %46, %bin.rdx
%bin.rdx72 = fadd reassoc contract <4 x double> %47, %bin.rdx71
%49 = call reassoc contract double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> %bin.rdx72)
%cmp.n = icmp eq i64 %arraylen, %n.vec
br i1 %cmp.n, label %L50, label %scalar.ph
scalar.ph: ; preds = %middle.block, %L31.preheader
%bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L31.preheader ]
%bc.merge.rdx = phi double [ %49, %middle.block ], [ 0.000000e+00, %L31.preheader ]
br label %idxend12
L50: ; preds = %idxend12, %middle.block, %L24
%value_phi16 = phi double [ 0.000000e+00, %L24 ], [ %49, %middle.block ], [ %56, %idxend12 ]
%50 = load {}*, {}** %6, align 8
%51 = bitcast {}*** %4 to {}**
store {}* %50, {}** %51, align 8
ret double %value_phi16
idxend12: ; preds = %idxend12, %scalar.ph
%value_phi447 = phi i64 [ %52, %idxend12 ], [ %bc.resume.val, %scalar.ph ]
%value_phi46 = phi double [ %56, %idxend12 ], [ %bc.merge.rdx, %scalar.ph ]
%52 = add nuw nsw i64 %value_phi447, 1
%53 = getelementptr inbounds double, double* %arrayptr38, i64 %value_phi447
%arrayref = load double, double* %53, align 8
%54 = getelementptr inbounds double, double* %arrayptr1439, i64 %value_phi447
%arrayref15 = load double, double* %54, align 8
%55 = fmul contract double %arrayref, %arrayref15
%56 = fadd reassoc contract double %value_phi46, %55
%exitcond.not = icmp eq i64 %52, %arraylen
br i1 %exitcond.not, label %L50, label %idxend12
}
I have avx 256
Julia Version 1.10.0-beta3
Commit 404750f858 (2023-10-03 12:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × AMD Ryzen 7 5700U with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 on 16 virtual cores
I think this is likely an avx-512 specific regression.
I think this is likely an avx-512 specific regression.
@brenhinkeller reported not having this regression on his M1.
I'm using -Cnative,-prefer-256-bit
and didn't confirm whethet I get the regression under default settings.
Yeah, I get equivalent performance between both functions on M1 both in 1.9.3 and 1.10.0-beta3 – though possibly just because neither seems to be vectorizing well there AFAICT?
I can only reproduce this when using --check-bounds=yes
, which I use more often than not.
Maybe it is limited to AVX512, but it definitely is not limited to 512 bit, using 256 bit vectors:
julia> @btime mydot($x,$y)
40.730 ns (0 allocations: 0 bytes)
88.99071159759116
julia> @btime mydot_fm($x,$y)
24.554 ns (0 allocations: 0 bytes)
88.99071159759114
vs 512
julia> @btime mydot($x,$y)
53.122 ns (0 allocations: 0 bytes)
87.47595130161963
julia> @btime mydot_fm($x,$y)
17.608 ns (0 allocations: 0 bytes)
87.47595130161963
I can reproduce with --check-bounds=yes
, the LLVM IR looks identical to what you posted earlier.
Julia 1.10-beta3:
On 1.9.3, I get