LoopVectorizer generating bad and unhandled shufflevector sequence

Quuxplusone commented 8 years ago


Bugzilla Link	PR30630
Status	RESOLVED WORKSFORME
Importance	P enhancement
Reported by	Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on	2016-10-07 03:41:42 -0700
Last modified on	2017-04-19 21:43:24 -0700
Version	trunk
Hardware	All Linux
CC	efriedma@quicinc.com, Hao.Liu@arm.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mcrosier@codeaurora.org, mssimpso@codeaurora.org, paulsson@linux.vnet.ibm.com, spatel+llvm@rotateright.com, uweigand@de.ibm.com
Fixed by commit(s)
Attachments	`doop.bc` (76256 bytes, application/octet-stream) `do_vop.bc` (13680 bytes, application/octet-stream) `shuffles_tc.ll` (145586 bytes, text/plain)
Blocks
Blocked by
See also

Quuxplusone commented 8 years ago

Attached doop.bc (76256 bytes, application/octet-stream): test case

Quuxplusone commented 8 years ago

I have experimented with enabling the LoopVectorizer for SystemZ, and come
across a loop which, when vectorized, seems to have been poorly generated.
There seems to be a completely unnecessary sequence of shufflevector
instructions, that doesn't get optimized away anywhere. In other words, there
is a shuffling that leads back to the original vector:

       [0 1 2 3 4 5 6 7]

 [0 4]   [1 5]   [2 6]   [3 7]

   [0 4 1 5]       [2 6 3 7]

       [0 1 2 3 4 5 6 7]

This should be handled by opt passes after the loop vectorizer.

Run test case with:
bin/opt -mcpu=z13 -S -O3 ./do_vop.bc -unroll-max-count=0 -force-target-
instruction-cost=2 -enable-interleaved-mem-accesses -force-target-max-vector-
interleave=1

There seems to be three vectorized loops with this problem. (Test case
originally from SPEC 2006)

Quuxplusone commented 8 years ago

Attached do_vop.bc (13680 bytes, application/octet-stream): reduced test case

Quuxplusone commented 8 years ago

Thanks Jonas.

From your previous message, I think this will boil down to teaching
InstSimplify/InstCombine how to simply something like:

define void @test(<8 x i64>* %arg, <8 x i64>* %arg1, <8 x i64>* %arg2) {
bb:
  %tmp = load <8 x i64>, <8 x i64>* %arg, align 8
  %tmp3 = load <8 x i64>, <8 x i64>* %arg1, align 8
  %tmp4 = and <8 x i64> %tmp3, %tmp
  %tmp5 = shufflevector <8 x i64> %tmp4, <8 x i64> undef, <2 x i32> <i32 0, i32 4>
  %tmp6 = shufflevector <8 x i64> %tmp4, <8 x i64> undef, <2 x i32> <i32 1, i32 5>
  %tmp7 = shufflevector <8 x i64> %tmp4, <8 x i64> undef, <2 x i32> <i32 2, i32 6>
  %tmp8 = shufflevector <8 x i64> %tmp4, <8 x i64> undef, <2 x i32> <i32 3, i32 7>
  %tmp9 = shufflevector <2 x i64> %tmp5, <2 x i64> %tmp6, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %tmp10 = shufflevector <2 x i64> %tmp7, <2 x i64> %tmp8, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %tmp11 = shufflevector <4 x i64> %tmp9, <4 x i64> %tmp10, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
  store <8 x i64> %tmp11, <8 x i64>* %arg2, align 8
  ret void
}

into:

define void @test(<8 x i64>* %arg, <8 x i64>* %arg1, <8 x i64>* %arg2) {
bb:
  %tmp = load <8 x i64>, <8 x i64>* %arg, align 8
  %tmp3 = load <8 x i64>, <8 x i64>* %arg1, align 8
  %tmp4 = and <8 x i64> %tmp3, %tmp
  store <8 x i64> %tmp4, <8 x i64>* %arg2, align 8
  ret void
}

The transformation might look something like this: If we have a shuffle X,
whose operands are both shuffles from the same vector Y, we can replace X with
a shuffle from Y. If this is applied iteratively, it should simplify the above
case.

Quuxplusone commented 8 years ago

(In reply to comment #4)
> Thanks Jonas.
>
> From your previous message, I think this will boil down to teaching
> InstSimplify/InstCombine how to simply something like:
>
> define void @test(<8 x i64>* %arg, <8 x i64>* %arg1, <8 x i64>* %arg2) {
> bb:
>   %tmp = load <8 x i64>, <8 x i64>* %arg, align 8
>   %tmp3 = load <8 x i64>, <8 x i64>* %arg1, align 8
>   %tmp4 = and <8 x i64> %tmp3, %tmp
>   %tmp5 = shufflevector <8 x i64> %tmp4, <8 x i64> undef, <2 x i32> <i32 0,
> i32 4>
>   %tmp6 = shufflevector <8 x i64> %tmp4, <8 x i64> undef, <2 x i32> <i32 1,
> i32 5>
>   %tmp7 = shufflevector <8 x i64> %tmp4, <8 x i64> undef, <2 x i32> <i32 2,
> i32 6>
>   %tmp8 = shufflevector <8 x i64> %tmp4, <8 x i64> undef, <2 x i32> <i32 3,
> i32 7>
>   %tmp9 = shufflevector <2 x i64> %tmp5, <2 x i64> %tmp6, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %tmp10 = shufflevector <2 x i64> %tmp7, <2 x i64> %tmp8, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %tmp11 = shufflevector <4 x i64> %tmp9, <4 x i64> %tmp10, <8 x i32> <i32
> 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
>   store <8 x i64> %tmp11, <8 x i64>* %arg2, align 8
>   ret void
> }
>
> into:
>
> define void @test(<8 x i64>* %arg, <8 x i64>* %arg1, <8 x i64>* %arg2) {
> bb:
>   %tmp = load <8 x i64>, <8 x i64>* %arg, align 8
>   %tmp3 = load <8 x i64>, <8 x i64>* %arg1, align 8
>   %tmp4 = and <8 x i64> %tmp3, %tmp
>   store <8 x i64> %tmp4, <8 x i64>* %arg2, align 8
>   ret void
> }
>
> The transformation might look something like this: If we have a shuffle X,
> whose operands are both shuffles from the same vector Y, we can replace X
> with a shuffle from Y. If this is applied iteratively, it should simplify
> the above case.

Hi MAtthew,

I tried your suggestion a bit inside the InstCombiner::
visitShufflerVectorInst() method:

...
   // case 3 or 4
   if (RHSShuffle && RHSOp0Width == LHSWidth) {
     newRHS = RHSOp0;
   }
+
   // case 4
-  if (LHSOp0 == RHSOp0) {
+  bool EqOp0s = (LHSOp0 == RHSOp0);
+  if (!EqOp0s && LHSOp0 != nullptr && RHSOp0 != nullptr) {
+    if (Instruction* LHSOp0Inst = dyn_cast<Instruction>(LHSOp0)) {
+      if (Instruction* RHSOp0Inst = dyn_cast<Instruction>(RHSOp0)) {
+        if (LHSOp0Inst->isIdenticalTo(RHSOp0Inst) &&
+            !LHSOp0Inst->mayHaveSideEffects() && !LHSOp0Inst-
>mayReadFromMemory() &&
+            !RHSOp0Inst->mayHaveSideEffects() && !RHSOp0Inst-
>mayReadFromMemory()) {
+          EqOp0s = true;
+        }
+      }
+    }
+  }
+  if (EqOp0s) {
     newLHS = LHSOp0;
     newRHS = nullptr;
   }

   if (newLHS == LHS && newRHS == RHS)
     return MadeChange ? &SVI : nullptr;
...

This may of course not be the best solution at all, this is just the point
where I came to a stop...

The first problem in my test case is that there are 4 identical AND
instructions, which are not handled anywhere until instruction selection. I
assume that this is because the loop vectorizer only works on one block and
therefore these redundant instructions should get always get handled by the DAG
machinery?
It would perhaps make sense to instead prune identical instructions locally
after the loop vectorizer in a simple pass (or is there a function to call
somewhere?), if this bug actually results in a patch.

The ANDs now get recognized as identical, which should mean that either one
could be considered as the input vector. It then continues under case 4.

Everything works fine, except that the conservativeness of this method refuses
to do the rewrite, because newMask != Mask. I suppose the reason for fearing
this comes from experience, so I must then think that in order to handle this,
one would have to do a recursive search for an identity mask in multiple levels.
I guess this might be true regardless of it being this special case of handling
identical instructions.

Is this making sense, and what are your thoughts?

/Jonas

Quuxplusone commented 8 years ago

It seems that it is quite beneficial for SystemZ (and other targets?) to optimize ShuffelVectorInstrs more than what is currently done. I think this probably should be done under control of a TTI flag.

Quuxplusone commented 7 years ago

InstSimplify patch posted for review:
https://reviews.llvm.org/D31960

Quuxplusone commented 7 years ago

Attached shuffles_tc.ll (145586 bytes, text/plain): unreduced test case (but small loop body)

Quuxplusone commented 7 years ago

%39 = shufflevector <4 x i32> %36, <4 x i32> undef, <12 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
  %interleaved.vec = shufflevector <12 x i32> %39, <12 x i32> undef, <12 x i32> <i32 0, i32 4, i32 8, i32 1, i32 5, i32 9, i32 2, i32 6, i32 10, i32 3, i32 7, i32 11>

So here we have:
  [0 1 2 3]

  [0 1 2 3 0 1 2 3 0 1 2 3]

  [0 0 0 1 1 1 2 2 2 3 3 3]

This would have to be an instcombine, and I don't think we can do the transform
without a cost model.

Quuxplusone commented 7 years ago

I am curious as to why it could ever be better to have two shufflevector instructions instead of just one (provided that there is no other use of the intermediate shuffle)?

Quuxplusone commented 7 years ago

(In reply to Jonas Paulsson from comment #10)
> I am curious as to why it could ever be better to have two shufflevector
> instructions instead of just one (provided that there is no other use of the
> intermediate shuffle)?

The problem is not the IR itself; it's that the backend can't lower arbitrary
shuffle masks effectively in all cases.

Eli showed an x86 example in https://reviews.llvm.org/D31509 where an arbitrary
shuffle mask for a single IR instruction ended up with worse codegen than
multiple shufflevector IR instructions.

Quuxplusone commented 7 years ago

Jonas - I think the original problem is solved after:
https://reviews.llvm.org/rL300714

Do you want to leave this report open to track other issues or open new reports
for those?

Quuxplusone commented 7 years ago

I beleive there is not an issue any more at least for the moment with shuffles on SystemZ :-)

Quuxplusone / LLVMBugzillaTest

LoopVectorizer generating bad and unhandled shufflevector sequence #29602