unneccessary widening lowers vector performance

Quuxplusone commented 4 years ago


Bugzilla Link	PR46888
Status	NEW
Importance	P enhancement
Reported by	Joel Hutton (joel.hutton@arm.com)
Reported on	2020-07-29 03:31:35 -0700
Last modified on	2021-08-04 14:05:45 -0700
Version	trunk
Hardware	Other Linux
CC	david.bolvansky@gmail.com, david.green@arm.com, efriedma@quicinc.com, florian_hahn@apple.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks	PR46929
Blocked by
See also

for the following test snippet:

include <math.h>

int arrSum(unsigned char a1, int inc_a1, unsigned char a2,
                                int inc_a2) {
  int sum = 0;
  for (int y = 0; y < 16; y++) {
    for (int x = 0; x < 16; x++) {
      sum += abs(a1[x] - a2[x]);
    }
    a1 += inc_a1;
    a2 += inc_a2;
  }
  return sum;
}

using clang -O3:

LLVM trunk widens bytes to 16 bit halfs with ushll instructions and then takes
the absolute differences of the halfs.

    sxtw    x8, w3
    ldr     q4, [x0]
    ldr     q5, [x2]
    add     x10, x0, x9
    add     x11, x2, x8
    ldr     q0, [x10]
    ldr     q1, [x11]
    add     x10, x10, x9
    add     x11, x11, x8
    ldr     q2, [x10]
    ldr     q3, [x11]
    add     x10, x10, x9
    add     x11, x11, x8
    ushll   v7.8h, v4.8b, #0
    ushll2  v16.8h, v4.16b, #0
    ushll   v17.8h, v5.8b, #0
    ushll2  v5.8h, v5.16b, #0
    ldr     q4, [x10]
    ldr     q6, [x11]
    uabdl   v18.4s, v16.4h, v5.4h
    uabdl   v19.4s, v7.4h, v17.4h
    uabdl2  v16.4s, v16.8h, v5.8h
    uabdl2  v17.4s, v7.8h, v17.8h

This is wider than necessary and the difference operation can be performed on
the bytes directly. Performing the operation directly on bytes processes 8
lanes at a time instead of 4 as well as avoiding unneccesary shifts. This can
be seen in the GCC codegen.

For gcc (trunk) -O3:

    sxtw    x2, w3
    sxtw    x3, w1
    add     x11, x5, x2
    ldr     q1, [x4]
    add     x9, x11, x2
    ldr     q5, [x5]
    add     x7, x9, x2
    ldr     q3, [x4, w1, sxtw]
    add     x4, x4, x3
    uabdl2  v2.8h, v1.16b, v5.16b
    add     x10, x4, x3
    ldr     q4, [x5, w2, sxtw]
    add     x8, x10, x3
    movi    v0.4s, 0
    add     x6, x8, x3
    uabal   v2.8h, v1.8b, v5.8b
    add     x5, x7, x2
    uabdl2  v1.8h, v3.16b, v4.16b
    add     x19, x5, x2
    ldr     q6, [x11, w2, sxtw]
    mov     x0, x2
    ldr     q5, [x4, w1, sxtw]
    add     x4, x6, x3
    uabal   v1.8h, v3.8b, v4.8b
    add     x30, x4, x3
    uadalp  v0.4s, v2.8h
    add     x18, x19, x2
    uabdl2  v2.8h, v5.16b, v6.16b

This affects SPEC2017 x264 performance.

Quuxplusone commented 4 years ago

From clang, we get the values in the inner loop extended to i32 (note that
example is without the abs() call).

  %20 = load i8*, i8** %5, align 8
  %21 = load i32, i32* %11, align 4
  %22 = sext i32 %21 to i64
  %23 = getelementptr inbounds i8, i8* %20, i64 %22
  %24 = load i8, i8* %23, align 1
  %25 = zext i8 %24 to i32
  %26 = load i8*, i8** %7, align 8
  %27 = load i32, i32* %11, align 4
  %28 = sext i32 %27 to i64
  %29 = getelementptr inbounds i8, i8* %26, i64 %28
  %30 = load i8, i8* %29, align 1
  %31 = zext i8 %30 to i32
  %32 = sub nsw i32 %25, %31
  %33 = load i32, i32* %9, align 4
  %34 = add nsw i32 %33, %32
  store i32 %34, i32* %9, align 4
  br label %35

I think we could narrow the width of the extends in SLPVectorizer and extend
the result, so extend the operands to i16, compute on i16, extend result to
i32, store. It seems like w have to drop the nsw flags though.

https://alive2.llvm.org/ce/z/bFjAHs

Quuxplusone commented 4 years ago

I think the entire sum in the original testcase fits into 16-bit arithmetic, since it's the sum of 256 8-bit numbers. Granted, it's a little tricky to detect that.

Quuxplusone commented 3 years ago

https://godbolt.org/z/3e9hY5GEa

On trunk we're now recognising the reductions (aarch64 is unrolling the outer
loop, x86_64 is not but the inner loop is where the fun is):

  %12 = bitcast i8* %10 to <16 x i8>*
  %13 = load <16 x i8>, <16 x i8>* %12, align 1
  %14 = zext <16 x i8> %13 to <16 x i32>
  %15 = bitcast i8* %11 to <16 x i8>*
  %16 = load <16 x i8>, <16 x i8>* %15, align 1
  %17 = zext <16 x i8> %16 to <16 x i32>
  %18 = sub nsw <16 x i32> %14, %17
  %19 = call <16 x i32> @llvm.abs.v16i32(<16 x i32> %18, i1 true)
  %20 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %19)

I think that SAD pattern is clear enough now that we could get value tracking
to recognise the upper bits of the reduction result aren't critical, and reduce
the width we extend to.

So I guess the next step is to add llvm.vector.reduce.add support to the
computeKnownBits/computeNumSignBits implementations.

Quuxplusone / LLVMBugzillaTest

unneccessary widening lowers vector performance #45857