Open Quuxplusone opened 6 years ago
Attached tc_instcombine.ll
(885 bytes, text/x-matlab): reduced testcase
If there was reason to believe that "shuffle [0 2 1 3]" is no more difficult
than "shuffle [1 0 3 2]" on any target, then it would be possible.
But barring that, I'm not seeing how to do this in a target-independent way in
instcombine. As noted, we're not allowed to generate shuffle masks that are not
obviously lower-able on any target.
So yes, a DAGCombine should do it. It would be nice if we could make these semi-
easy common cases available in DAGCombiner using a target-hook to enable them,
but even that is probably hard to determine.
x86 gets this with -mattr=avx, so you should be able to replicate whatever it
is doing here:
define <4 x i64> @shufsimp(<4 x i64> %x) {
%tmp7 = shufflevector <4 x i64> %x, <4 x i64> undef, <2 x i32> <i32 0, i32 2>
%tmp8 = shufflevector <4 x i64> %x, <4 x i64> undef, <2 x i32> <i32 1, i32 3>
%tmp9 = shufflevector <2 x i64> %tmp8, <2 x i64> %tmp7, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%tmp10 = shufflevector <4 x i64> %tmp9, <4 x i64> undef, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
ret <4 x i64> %tmp10
}
$ llc -o - shuf.ll -mattr=avx
...
vpermilpd $5, %ymm0, %ymm0 ## ymm0 = ymm0[1,0,3,2]
Another possibility: if the vectorizers are creating these sub-optimal sequences, can they be enhanced to squash them as part of their job? It used cost models to produce the shuffles in the 1st place, so it can't be much of a stretch to use those same models to simplify chains of shuffles? Maybe part of SLP's responsibility?
Thanks for the feedback!
I'm not seeing how to do this in a target-independent way in instcombine. As noted, we're not allowed to generate shuffle masks that are not obviously lower-able on any target.
SystemZ has a vector permute instruction that can take two (or one) input vectors and make any desired permutation of the elements. Therefore shuffles have the same cost regardless of the indices, as long as all the indices fit inside two vectors. If this is true for other targets also, we could add a method hasVectorPermute() or so and if that method returns yes, then allow InstCombine to choose indices freely. Or is this more complex than that?
Or perhaps InstCombine could use the TTI->getShuffleCost() method to ask target and compare costs for different shuffles?
It seems cleaner to have InstCombine do this (since it is already claiming to do so) rather than having the last-to-run vectorizer (SLP) clean up for earlier vectorizations, or?
Is there a difference between shuffles produced by vectorizers and those in the input program, in regards to if Instcombine should preserve them? If that would be a general rule, then it seems each vectorizer would have to generate shuffles optimally since InstCombine can't improve them.
(In reply to Jonas Paulsson from comment #3)
> Thanks for the feedback!
>
> > I'm not seeing how to do this in a target-independent way in instcombine.
As noted, we're not allowed to generate shuffle masks that are not obviously
lower-able on any target.
>
> SystemZ has a vector permute instruction that can take two (or one) input
> vectors and make any desired permutation of the elements. Therefore shuffles
> have the same cost regardless of the indices, as long as all the indices fit
> inside two vectors. If this is true for other targets also, we could add a
> method hasVectorPermute() or so and if that method returns yes, then allow
> InstCombine to choose indices freely. Or is this more complex than that?
>
> Or perhaps InstCombine could use the TTI->getShuffleCost() method to ask
> target and compare costs for different shuffles?
>
> It seems cleaner to have InstCombine do this (since it is already claiming
> to do so) rather than having the last-to-run vectorizer (SLP) clean up for
> earlier vectorizations, or?
>
> Is there a difference between shuffles produced by vectorizers and those in
> the input program, in regards to if Instcombine should preserve them? If
> that would be a general rule, then it seems each vectorizer would have to
> generate shuffles optimally since InstCombine can't improve them.
No, instcombine has no way to determine how the instruction was created AFAIK
(metadata?), so that's why it is taking the safe/conservative path.
These questions lead back to llvm-dev discussions about the role of
instcombine. Instcombine is not the place for target-specific optimizations
(although it does have them for target-specific intrinsics, I think that's
considered a mistake).
The general argument is that everyone wants to add functionality to instcombine
because it's the easiest solution, but it's too costly in compile-time to do
that. That's because instcombine is not particularly efficient, and it appears
many (8?) times in the typical -O2 optimization pipeline.
I understand that this is extra muddy for shufflevector because we bypass
obviously good IR transforms because we know the backend can't always recover.
For simpler ops, we can just add some reversal logic. See for example this pair
of patches:
https://reviews.llvm.org/D50992
https://reviews.llvm.org/D51553
But that may not be possible for shufflevector.
We could have a separate IR pass specifically for vector op (
http://llvm.org/docs/LangRef.html#vector-operations ) cleanup. It would look
similar to the existing instcombine for vector ops, but allow use of the cost
model, and only run after the vectorizers or in some other limited way. We
might also be able to move and improve
InstCombiner::SimplifyDemandedVectorElts() by creating this new pass. If that
sounds reasonable, it should be brought up for discussion on llvm-dev.
tc_instcombine.ll
(885 bytes, text/x-matlab)