Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Terrible shuffle lowering for zip of two i8 values (all backends) #30274

Closed Quuxplusone closed 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR31301
Status RESOLVED FIXED
Importance P normal
Reported by Eli Friedman (efriedma@quicinc.com)
Reported on 2016-12-06 20:53:39 -0800
Last modified on 2016-12-15 16:42:06 -0800
Version trunk
Hardware PC Windows NT
CC llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mkuper@google.com, spatel+llvm@rotateright.com, zvirack@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
C testcase for ARM:

#include <arm_neon.h>
uint8x8_t f(char* x, char *y)
{
  return vzip_u8(vld1_dup_u8(x), vld1_dup_u8(y)).val[0];
}

IR testcase:

define <8 x i8> @vdup_zip(i8* nocapture readonly %x, i8* nocapture readonly %y)
{
entry:
  %0 = load i8, i8* %x, align 1
  %1 = insertelement <8 x i8> undef, i8 %0, i32 0
  %lane = shufflevector <8 x i8> %1, <8 x i8> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef>
  %2 = load i8, i8* %y, align 1
  %3 = insertelement <8 x i8> undef, i8 %2, i32 0
  %lane3 = shufflevector <8 x i8> %3, <8 x i8> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef>
  %vzip.i = shufflevector <8 x i8> %lane, <8 x i8> %lane3, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
  ret <8 x i8> %vzip.i
}

IR looks fine.  CodeGen gives:

        ldrb    r0, [r0]
        ldrb    r1, [r1]
        vmov.8  d16[0], r0
        vmov.8  d16[1], r1
        vmov.8  d16[2], r0
        vmov.8  d16[3], r1
        vmov.8  d16[4], r0
        vmov.8  d16[5], r1
        vmov.8  d16[6], r0
        vmov.8  d16[7], r1
        vmov    r0, r1, d16
        bx      lr

i.e. we've managed to blow up a simple three-instruction NEON sequence into ten
instructions.

Slight variant for testing on architectures which have `16 x i8`, not `8 x i8`:

define <16 x i8> @vdup_zip(i8* nocapture readonly %x, i8* nocapture readonly
%y)  {
entry:
  %0 = load i8, i8* %x, align 1
  %1 = insertelement <16 x i8> undef, i8 %0, i32 0
  %lane = shufflevector <16 x i8> %1, <16 x i8> undef, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %2 = load i8, i8* %y, align 1
  %3 = insertelement <16 x i8> undef, i8 %2, i32 0
  %lane3 = shufflevector <16 x i8> %3, <16 x i8> undef, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %vzip.i = shufflevector <16 x i8> %lane, <16 x i8> %lane3, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
  ret <16 x i8> %vzip.i
}

It looks like DAGCombine turns the IR into a BUILD_VECTOR, and the ARM backend
can't recover the shape.  Actually, it looks like every backend fails to
produce the obvious lowering; aarch64 generates a sequence of ins instructions,
x86 generates a bunch of vpinsrb instructions, systemz generates a sequence of
vlvgb.  powerpc manages to at least generate a shuffle, but it generates two
extra instructions because it doesn't manage to pick the right shuffle.

I'm not exactly sure what the right solution looks like here; maybe we can do
something more helpful on a target-independent level than just throwing away
the shuffles and creating a BUILD_VECTOR?
Quuxplusone commented 7 years ago

Zvi - is this due to rL285063?

Quuxplusone commented 7 years ago

I rolled back rL285063 and ran the top example with: llc -march=arm -mattr=+neon and got the same generated code as reported here.

Quuxplusone commented 7 years ago

The problem goes away if I disable combineShuffleOfScalars... or at least it improves to some extent. (There's a few smaller problems on ARM for the original testcase, but we do at least generate a shuffle.)

It looks like the combine was originally added in r234004 (https://reviews.llvm.org/D8516); I think this case is getting picked up by accident because each input scalar_to_vector is used "once".

Quuxplusone commented 7 years ago

Partially fixed by https://reviews.llvm.org/rL289874 ; we no longer generate element-by-element insertion, but we still replace a shuffle with another, less efficient shuffle.

Remaining fix under review in https://reviews.llvm.org/D27793 .

Quuxplusone commented 7 years ago

https://reviews.llvm.org/rL289882