[InstCombine] Fails to combine three shufflevectors produced by LoopVectorizer

Quuxplusone commented 6 years ago


Bugzilla Link	PR38792
Status	NEW
Importance	P enhancement
Reported by	Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on	2018-08-31 10:49:50 -0700
Last modified on	2018-09-03 09:10:27 -0700
Version	trunk
Hardware	PC Linux
CC	craig.topper@gmail.com, efriedma@quicinc.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mssimpso@codeaurora.org, paulsson@linux.vnet.ibm.com, spatel+llvm@rotateright.com, uweigand@de.ibm.com
Fixed by commit(s)
Attachments	`tc_instcombine.ll` (885 bytes, text/x-matlab)
Blocks
Blocked by
See also

Created attachment 20815
reduced testcase

The LoopVectorizer has interleaved loads and stores. Basically, the loaded
elements should pairwise be reversed like

[0 1 2 3]  -> [1 0 3 2]

The Vectorizer does not understand this but generates from two interleave
groups first a result for the load group, and then makes another shuffle for
the store group

 %tmp6 = load <4 x i64>, <4 x i64>* %tmp5, align 8
  %tmp7 = shufflevector <4 x i64> %tmp6, <4 x i64> undef, <2 x i32> <i32 0, i32 2>
  %tmp8 = shufflevector <4 x i64> %tmp6, <4 x i64> undef, <2 x i32> <i32 1, i32 3>
  %tmp9 = shufflevector <2 x i64> %tmp8, <2 x i64> %tmp7, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %tmp10 = shufflevector <4 x i64> %tmp9, <4 x i64> undef, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  store <4 x i64> %tmp10, <4 x i64>* undef, align 8

This results in [1 0 3 2], and I would have hoped that this would become a
single shufflevector after instcombine, but this does not happen.

There are comments in InstCombine that this is purposely done very
conservatively. It is however clear that this does not give good code on
SystemZ.

I wonder if anyone has any idea if InstCombiner should handle this case, or if
not, where should this be done. A custom DAGCombine by the target?

bin/opt  ./tc_instcombine.ll -mtriple=systemz-unknown -mcpu=z13 -S -o
out.opt.ll -instcombine

Quuxplusone commented 6 years ago

Attached tc_instcombine.ll (885 bytes, text/x-matlab): reduced testcase

Quuxplusone commented 6 years ago

If there was reason to believe that "shuffle [0 2 1 3]" is no more difficult
than "shuffle [1 0 3 2]" on any target, then it would be possible.

But barring that, I'm not seeing how to do this in a target-independent way in
instcombine. As noted, we're not allowed to generate shuffle masks that are not
obviously lower-able on any target.

So yes, a DAGCombine should do it. It would be nice if we could make these semi-
easy common cases available in DAGCombiner using a target-hook to enable them,
but even that is probably hard to determine.

x86 gets this with -mattr=avx, so you should be able to replicate whatever it
is doing here:

define <4 x i64> @shufsimp(<4 x i64> %x) {
  %tmp7 = shufflevector <4 x i64> %x, <4 x i64> undef, <2 x i32> <i32 0, i32 2>
  %tmp8 = shufflevector <4 x i64> %x, <4 x i64> undef, <2 x i32> <i32 1, i32 3>
  %tmp9 = shufflevector <2 x i64> %tmp8, <2 x i64> %tmp7, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %tmp10 = shufflevector <4 x i64> %tmp9, <4 x i64> undef, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x i64> %tmp10
}

$ llc -o - shuf.ll -mattr=avx
...
    vpermilpd   $5, %ymm0, %ymm0 ## ymm0 = ymm0[1,0,3,2]

Quuxplusone commented 6 years ago

Another possibility: if the vectorizers are creating these sub-optimal sequences, can they be enhanced to squash them as part of their job? It used cost models to produce the shuffles in the 1st place, so it can't be much of a stretch to use those same models to simplify chains of shuffles? Maybe part of SLP's responsibility?

Quuxplusone commented 6 years ago

Thanks for the feedback!

I'm not seeing how to do this in a target-independent way in instcombine. As noted, we're not allowed to generate shuffle masks that are not obviously lower-able on any target.

SystemZ has a vector permute instruction that can take two (or one) input vectors and make any desired permutation of the elements. Therefore shuffles have the same cost regardless of the indices, as long as all the indices fit inside two vectors. If this is true for other targets also, we could add a method hasVectorPermute() or so and if that method returns yes, then allow InstCombine to choose indices freely. Or is this more complex than that?

Or perhaps InstCombine could use the TTI->getShuffleCost() method to ask target and compare costs for different shuffles?

It seems cleaner to have InstCombine do this (since it is already claiming to do so) rather than having the last-to-run vectorizer (SLP) clean up for earlier vectorizations, or?

Is there a difference between shuffles produced by vectorizers and those in the input program, in regards to if Instcombine should preserve them? If that would be a general rule, then it seems each vectorizer would have to generate shuffles optimally since InstCombine can't improve them.

Quuxplusone commented 6 years ago

(In reply to Jonas Paulsson from comment #3)
> Thanks for the feedback!
>
> > I'm not seeing how to do this in a target-independent way in instcombine.
As noted, we're not allowed to generate shuffle masks that are not obviously
lower-able on any target.
>
> SystemZ has a vector permute instruction that can take two (or one) input
> vectors and make any desired permutation of the elements. Therefore shuffles
> have the same cost regardless of the indices, as long as all the indices fit
> inside two vectors. If this is true for other targets also, we could add a
> method hasVectorPermute() or so and if that method returns yes, then allow
> InstCombine to choose indices freely. Or is this more complex than that?
>
> Or perhaps InstCombine could use the TTI->getShuffleCost() method to ask
> target and compare costs for different shuffles?
>
> It seems cleaner to have InstCombine do this (since it is already claiming
> to do so) rather than having the last-to-run vectorizer (SLP) clean up for
> earlier vectorizations, or?
>
> Is there a difference between shuffles produced by vectorizers and those in
> the input program, in regards to if Instcombine should preserve them? If
> that would be a general rule, then it seems each vectorizer would have to
> generate shuffles optimally since InstCombine can't improve them.

No, instcombine has no way to determine how the instruction was created AFAIK
(metadata?), so that's why it is taking the safe/conservative path.

These questions lead back to llvm-dev discussions about the role of
instcombine. Instcombine is not the place for target-specific optimizations
(although it does have them for target-specific intrinsics, I think that's
considered a mistake).

The general argument is that everyone wants to add functionality to instcombine
because it's the easiest solution, but it's too costly in compile-time to do
that. That's because instcombine is not particularly efficient, and it appears
many (8?) times in the typical -O2 optimization pipeline.

I understand that this is extra muddy for shufflevector because we bypass
obviously good IR transforms because we know the backend can't always recover.
For simpler ops, we can just add some reversal logic. See for example this pair
of patches:
https://reviews.llvm.org/D50992
https://reviews.llvm.org/D51553

But that may not be possible for shufflevector.

We could have a separate IR pass specifically for vector op (
http://llvm.org/docs/LangRef.html#vector-operations ) cleanup. It would look
similar to the existing instcombine for vector ops, but allow use of the cost
model, and only run after the vectorizers or in some other limited way. We
might also be able to move and improve
InstCombiner::SimplifyDemandedVectorElts() by creating this new pass. If that
sounds reasonable, it should be brought up for discussion on llvm-dev.

Quuxplusone / LLVMBugzillaTest

[InstCombine] Fails to combine three shufflevectors produced by LoopVectorizer #37765