Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Failure to remove before/after reverse shuffles from loop with decrementing iterator #48585

Open Quuxplusone opened 3 years ago

Quuxplusone commented 3 years ago
Bugzilla Link PR49616
Status NEW
Importance P enhancement
Reported by Simon Pilgrim (llvm-dev@redking.me.uk)
Reported on 2021-03-17 09:07:12 -0700
Last modified on 2021-03-17 11:32:26 -0700
Version trunk
Hardware PC Windows NT
CC anton.a.afanasyev@gmail.com, florian_hahn@apple.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
https://godbolt.org/z/dWTd9e

void loop_dec(unsigned short *p, int max, int n) {
    for (int i = 0; i < (n & 31); i++) {
        unsigned m = *--p;
        *p = (unsigned short)(m >= max ? m-max : 0);
    }
}

By decrementing the pointer in the loop we end up with loop bodies like this:

52:
  %53 = phi i64 [ %41, %40 ], [ %64, %52 ]
  %54 = sub i64 0, %53
  %55 = getelementptr inbounds i16, i16* %51, i64 %54
  %56 = bitcast i16* %55 to <4 x i16>*
  %57 = load <4 x i16>, <4 x i16>* %56, align 2, !tbaa !2
  %58 = shufflevector <4 x i16> %57, <4 x i16> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %59 = zext <4 x i16> %58 to <4 x i32>
  %60 = call <4 x i32> @llvm.usub.sat.v4i32(<4 x i32> %59, <4 x i32> %50)
  %61 = trunc <4 x i32> %60 to <4 x i16>
  %62 = shufflevector <4 x i16> %61, <4 x i16> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %63 = bitcast i16* %55 to <4 x i16>*
  store <4 x i16> %62, <4 x i16>* %63, align 2, !tbaa !2
  %64 = add i64 %53, 4
  %65 = icmp eq i64 %64, %45
  br i1 %65, label %66, label %52, !llvm.loop !9

AFAICT we should be able to remove both these 'reverse' shufflevectors.

I'm not sure if we should be trying to fix this in InstCombine/VectorCombine or
inside the LoopVectorizer.
Quuxplusone commented 3 years ago
I think instcombine already has something to deal with this kind of problem,
perhaps it doesn't handle ext/trunc/saturating math?
Quuxplusone commented 3 years ago
https://godbolt.org/z/ePMEeT

We don't even fold:

define <4 x i32> @reverse_add_basic(<4 x i32> %a0, <4 x i32> %a1) {
  %r0 = shufflevector <4 x i32> %a0, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %ss = add <4 x i32> %r0, %a1
  %r1 = shufflevector <4 x i32> %ss, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  ret <4 x i32> %r1
}

to:

define <4 x i32> @reverse_add_basic(<4 x i32> %a0, <4 x i32> %a1) {
  %r1 = shufflevector <4 x i32> %a1, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %ss = add <4 x i32> %a0, %r1
  ret <4 x i32> %ss
}

let alone anything that contains zext/trunc/intrinsics....

[Bug #46894] is sort of related to intrinsics handling