Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

unneccessary widening lowers vector performance #45857

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR46888
Status NEW
Importance P enhancement
Reported by Joel Hutton (joel.hutton@arm.com)
Reported on 2020-07-29 03:31:35 -0700
Last modified on 2021-08-04 14:05:45 -0700
Version trunk
Hardware Other Linux
CC david.bolvansky@gmail.com, david.green@arm.com, efriedma@quicinc.com, florian_hahn@apple.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks PR46929
Blocked by
See also
for the following test snippet:

include <math.h>

int arrSum(unsigned char a1, int inc_a1, unsigned char a2,
                                int inc_a2) {
  int sum = 0;
  for (int y = 0; y < 16; y++) {
    for (int x = 0; x < 16; x++) {
      sum += abs(a1[x] - a2[x]);
    }
    a1 += inc_a1;
    a2 += inc_a2;
  }
  return sum;
}

using clang -O3:

LLVM trunk widens bytes to 16 bit halfs with ushll instructions and then takes
the absolute differences of the halfs.

    sxtw    x8, w3
    ldr     q4, [x0]
    ldr     q5, [x2]
    add     x10, x0, x9
    add     x11, x2, x8
    ldr     q0, [x10]
    ldr     q1, [x11]
    add     x10, x10, x9
    add     x11, x11, x8
    ldr     q2, [x10]
    ldr     q3, [x11]
    add     x10, x10, x9
    add     x11, x11, x8
    ushll   v7.8h, v4.8b, #0
    ushll2  v16.8h, v4.16b, #0
    ushll   v17.8h, v5.8b, #0
    ushll2  v5.8h, v5.16b, #0
    ldr     q4, [x10]
    ldr     q6, [x11]
    uabdl   v18.4s, v16.4h, v5.4h
    uabdl   v19.4s, v7.4h, v17.4h
    uabdl2  v16.4s, v16.8h, v5.8h
    uabdl2  v17.4s, v7.8h, v17.8h

This is wider than necessary and the difference operation can be performed on
the bytes directly. Performing the operation directly on bytes processes 8
lanes at a time instead of 4 as well as avoiding unneccesary shifts. This can
be seen in the GCC codegen.

For gcc (trunk) -O3:

    sxtw    x2, w3
    sxtw    x3, w1
    add     x11, x5, x2
    ldr     q1, [x4]
    add     x9, x11, x2
    ldr     q5, [x5]
    add     x7, x9, x2
    ldr     q3, [x4, w1, sxtw]
    add     x4, x4, x3
    uabdl2  v2.8h, v1.16b, v5.16b
    add     x10, x4, x3
    ldr     q4, [x5, w2, sxtw]
    add     x8, x10, x3
    movi    v0.4s, 0
    add     x6, x8, x3
    uabal   v2.8h, v1.8b, v5.8b
    add     x5, x7, x2
    uabdl2  v1.8h, v3.16b, v4.16b
    add     x19, x5, x2
    ldr     q6, [x11, w2, sxtw]
    mov     x0, x2
    ldr     q5, [x4, w1, sxtw]
    add     x4, x6, x3
    uabal   v1.8h, v3.8b, v4.8b
    add     x30, x4, x3
    uadalp  v0.4s, v2.8h
    add     x18, x19, x2
    uabdl2  v2.8h, v5.16b, v6.16b

This affects SPEC2017 x264 performance.
Quuxplusone commented 4 years ago
From clang, we get the values in the inner loop extended to i32 (note that
example is without the abs() call).

  %20 = load i8*, i8** %5, align 8
  %21 = load i32, i32* %11, align 4
  %22 = sext i32 %21 to i64
  %23 = getelementptr inbounds i8, i8* %20, i64 %22
  %24 = load i8, i8* %23, align 1
  %25 = zext i8 %24 to i32
  %26 = load i8*, i8** %7, align 8
  %27 = load i32, i32* %11, align 4
  %28 = sext i32 %27 to i64
  %29 = getelementptr inbounds i8, i8* %26, i64 %28
  %30 = load i8, i8* %29, align 1
  %31 = zext i8 %30 to i32
  %32 = sub nsw i32 %25, %31
  %33 = load i32, i32* %9, align 4
  %34 = add nsw i32 %33, %32
  store i32 %34, i32* %9, align 4
  br label %35

I think we could narrow the width of the extends in SLPVectorizer and extend
the result, so extend the operands to i16, compute on i16, extend result to
i32, store. It seems like w have to drop the nsw flags though.

https://alive2.llvm.org/ce/z/bFjAHs
Quuxplusone commented 4 years ago

I think the entire sum in the original testcase fits into 16-bit arithmetic, since it's the sum of 256 8-bit numbers. Granted, it's a little tricky to detect that.

Quuxplusone commented 3 years ago
https://godbolt.org/z/3e9hY5GEa

On trunk we're now recognising the reductions (aarch64 is unrolling the outer
loop, x86_64 is not but the inner loop is where the fun is):

  %12 = bitcast i8* %10 to <16 x i8>*
  %13 = load <16 x i8>, <16 x i8>* %12, align 1
  %14 = zext <16 x i8> %13 to <16 x i32>
  %15 = bitcast i8* %11 to <16 x i8>*
  %16 = load <16 x i8>, <16 x i8>* %15, align 1
  %17 = zext <16 x i8> %16 to <16 x i32>
  %18 = sub nsw <16 x i32> %14, %17
  %19 = call <16 x i32> @llvm.abs.v16i32(<16 x i32> %18, i1 true)
  %20 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %19)

I think that SAD pattern is clear enough now that we could get value tracking
to recognise the upper bits of the reduction result aren't critical, and reduce
the width we extend to.

So I guess the next step is to add llvm.vector.reduce.add support to the
computeKnownBits/computeNumSignBits implementations.