Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

vector ZERO_EXTEND gets scalarized #46101

Closed Quuxplusone closed 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR47132
Status RESOLVED FIXED
Importance P enhancement
Reported by Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on 2020-08-12 04:14:14 -0700
Last modified on 2020-09-08 08:00:41 -0700
Version trunk
Hardware PC Linux
CC efriedma@quicinc.com, llvm-bugs@lists.llvm.org, paulsson@linux.vnet.ibm.com, uweigand@de.ibm.com
Fixed by commit(s)
Attachments tc_or_i1.ll (261 bytes, text/plain)
tc_widenvecres_convert_FAIL.ll (552 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 23844
test case

The (first) DAGCombiner will replace a DAG like:

       t8: v2f64 = BUILD_VECTOR ConstantFP:f64<0.000000e+00>, ConstantFP:f64<0.000000e+00>
     t10: v2i1 = setcc t2, t8, setoeq:ch
   t11: v2i32 = zero_extend t10

in visitZERO_EXTEND() to

          t8: v2f64 = BUILD_VECTOR ConstantFP:f64<0.000000e+00>, ConstantFP:f64<0.000000e+00>
        t16: v2i64 = setcc t2, t8, setoeq:ch
      t17: v2i32 = truncate t16
     t19: v2i32 = BUILD_VECTOR Constant:i32<1>, Constant:i32<1>
   t20: v2i32 = and t17, t19

This works only with a SETCC operand (DAGCombiner.cpp:10482), so an OR of two
SETCC results is currently unhandled, and the type legalizer will then
scalarize the zero_extend:

t11: v2i32 = zero_extend t10
 WidenVectorResult()
   WidenVecRes_Convert() -> scalarization

Zero extending a vector of i1 is not quite simple, as it involves first
truncating / extending the vector select result, and then and:ing each element
with 1. Perhaps it would a custom DAG combine (pre-legalize types) is the right
approach given this? Or coudl the common-code that currently just handles a
SETCC operand be extended to handle this type of case, maybe?

The DAGTypeLegalizer wants to either widen the InVT (v2i1) or have the
InWidenVT to be legal (v4i1), or it will scalarize. I am not sure if that could
be made to work, but at first glance it doesn't look right.

llc -O3 -mcpu=z14 -o - ./tc_or_i1.ll -debug-only="isel,legalize-types"
Quuxplusone commented 4 years ago

Attached tc_or_i1.ll (261 bytes, text/plain): test case

Quuxplusone commented 4 years ago
Maybe we could try to do something in WidenVecRes_Convert?  It's not clear
what, exactly, we would do.  An arbitrary v2i1 will promote to v2i64; I guess
we could perform that promotion in WidenVecRes_Convert?  Something like the
following; we still end up with a BUILD_VECTOR, but it seems to optimize more
effectively.

diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 756a210..6d8833f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -3290,6 +3290,18 @@ SDValue DAGTypeLegalizer::WidenVecRes_Convert(SDNode *N)
{
   unsigned Opcode = N->getOpcode();
   unsigned InVTNumElts = InVT.getVectorNumElements();
   const SDNodeFlags Flags = N->getFlags();
+
+  if (N->getOpcode() == ISD::ZERO_EXTEND &&
+      getTypeAction(InVT) == TargetLowering::TypePromoteInteger) {
+    InOp = ZExtPromotedInteger(InOp);
+    InVT = InOp.getValueType();
+    InEltVT = InVT.getVectorElementType();
+    InWidenVT = EVT::getVectorVT(*DAG.getContext(), InEltVT, WidenNumElts);
+    InVTNumElts = InVT.getVectorNumElements();
+    if (WidenVT.getScalarSizeInBits() < InVT.getScalarSizeInBits())
+      Opcode = ISD::TRUNCATE;
+  }
+
   if (getTypeAction(InVT) == TargetLowering::TypeWidenVector) {
     InOp = GetWidenedVector(N->getOperand(0));
     InVT = InOp.getValueType();
@@ -3341,6 +3353,8 @@ SDValue DAGTypeLegalizer::WidenVecRes_Convert(SDNode *N) {

   // Otherwise unroll into some nasty scalar code and rebuild the vector.
   EVT EltVT = WidenVT.getVectorElementType();
+  if (WidenVT.getScalarSizeInBits() < InVT.getScalarSizeInBits())
+    EltVT = InVT.getVectorElementType();
   SmallVector<SDValue, 16> Ops(WidenNumElts, DAG.getUNDEF(EltVT));
   // Use the original element count so we don't do more scalar opts than
   // necessary.
Quuxplusone commented 4 years ago
Hmm.  Looks like the main problem is that the zero_extend node involves two
types that prefer different legalization strategies:

v2i1 would prefer promotion to v2i64
v2i32 would prefer widening to v4i32

However, the main loop in DAGTypeLegalizer::run initially only uses the
*result* type to decide a strategy, which implies widening here.

What we really want here is to use the promoted v2i64 input, and transform it
into a v4i32 output, which here means to perform a widened truncate.

And this is in fact pretty much what the proposed patch does, so that looks
good to me.  It is true that we still fall down into the BUILD_VECTOR case.  We
avoid this for the extension cases by making use of the *_EXTEND_VECTOR_INREG
opcodes.  I guess we could do the same for truncation if we had an analogous
TRUNCATE_VECTOR_INREG operation.

On the other hand, there's probably not much semantic difference between a
hypothetical TRUNCATE_VECTOR_INREG and open-coding the same via a BUILD_VECTOR
of TRUNCATEs, which may be why those optimize well.

Jonas, how does your original test case perform with Eli's patch?
Quuxplusone commented 4 years ago

Attached tc_widenvecres_convert_FAIL.ll (552 bytes, text/plain): reduced testcase (for Elis patch)

Quuxplusone commented 4 years ago

I hacked it together quickly as a proof of concept, so I'm not completely surprised I broke something.

Looking at the patch again, I think the second hunk is wrong; we shouldn't need to mess with the type of the BUILD_VECTOR operands. I think it was avoiding a crash somehow, but probably I didn't diagnose the root cause correctly.

Quuxplusone commented 4 years ago
(In reply to Eli Friedman from comment #4)
> I hacked it together quickly as a proof of concept, so I'm not completely
> surprised I broke something.
>
> Looking at the patch again, I think the second hunk is wrong; we shouldn't
> need to mess with the type of the BUILD_VECTOR operands.  I think it was
> avoiding a crash somehow, but probably I didn't diagnose the root cause
> correctly.

My test case didn't seem to produce a bad DAG with that second hunk, which made
the truncate become a no-op which is still correct since the BUILD_VECTOR
performs an implicit truncation. However, without the second hunk the truncate
node is created, which causes the DAG combiner to behave differently and the
test case no longer fails.

It makes sense to me to remove the second hunk I think, but I have not seen any
crash without it like you mentioned... I tried to make a test case that had a
wider WidenVT (comparing v8i16 and zero extending the i1 vector to v2i64, but
that ended up to be handled by splitting, so it didn't seem to reach here.

I could now build SPEC (17) and it seems that there are some 25 cases where
this triggers, which is kind of what I was expecting:

nilf           :                10770                10723      -47
vlgvf          :                 7385                 7338      -47
vn             :                 2091                 2116      +25
vpkg           :                  914                  939      +25
vlvgf          :                 2698                 2673      -25
vlvgp          :                 9634                 9611      -23
vrepig         :                 1503                 1513      +10
...

The SystemZ/CodeGen tests are also passing.
Quuxplusone commented 4 years ago
Oh, maybe (In reply to Jonas Paulsson from comment #5)
> (In reply to Eli Friedman from comment #4)
> > I hacked it together quickly as a proof of concept, so I'm not completely
> > surprised I broke something.
> >
> > Looking at the patch again, I think the second hunk is wrong; we shouldn't
> > need to mess with the type of the BUILD_VECTOR operands.  I think it was
> > avoiding a crash somehow, but probably I didn't diagnose the root cause
> > correctly.
>
> My test case didn't seem to produce a bad DAG with that second hunk, which
> made the truncate become a no-op which is still correct since the
> BUILD_VECTOR performs an implicit truncation. However, without the second
> hunk the truncate node is created, which causes the DAG combiner to behave
> differently and the test case no longer fails.

You could argue there's a bug in the SystemZ backend where it's incorrectly
marking nodes legal, but really, there isn't any reason for us to generate
insert_vector_elt etc. with overwide scalar types.

> It makes sense to me to remove the second hunk I think, but I have not seen
> any crash without it like you mentioned... I tried to make a test case that
> had a wider WidenVT (comparing v8i16 and zero extending the i1 vector to
> v2i64, but that ended up to be handled by splitting, so it didn't seem to
> reach here.

Okay, that seems fine.
Quuxplusone commented 4 years ago
> You could argue there's a bug in the SystemZ backend where it's incorrectly
marking nodes legal, but really, there isn't any reason for us to generate
insert_vector_elt etc. with overwide scalar types.

The insert_vector_elt that fails during isel is:

   t111: v4i32 = BUILD_VECTOR undef:i64, Constant:i64<0>, Constant:i64<0>, undef:i64
   t78: i64 = sign_extend t100
 t112: v4i32 = insert_vector_elt t111, t78, Constant:i32<0>

I thought this was a missing pattern for a truncating insert_vector_elt. But I
suppose this is not worth more attention now since we got rid of that problem...

Anyway.. with your patch we now avoid element extraction which is expensive so
this is a very nice improvement. However, since the SETCC of v2i32 is first
promoted to v2i64, and then truncated back to v2i32, I see a room for further
improvement:

trunk:

fun:                                    # @fun
        .cfi_startproc
# %bb.0:
        vgbm    %v0, 0
        vceqf   %v1, %v24, %v0
        vceqf   %v0, %v26, %v0
        vuphf   %v1, %v1
        vuphf   %v0, %v0
        vo      %v0, %v1, %v0
        vlgvf   %r1, %v0, 3
        vlgvf   %r0, %v0, 1
        nilf    %r1, 1
        vlvgp   %v24, %r1, %r1
        nilf    %r0, 1
        vlvgf   %v24, %r0, 0
        br      %r14

With your patch:

fun:                                    # @fun
        .cfi_startproc
# %bb.0:
        vgbm    %v0, 0
        vceqf   %v1, %v24, %v0
        vceqf   %v0, %v26, %v0
        vmrhf   %v1, %v1, %v1     // <-
        vmrhf   %v0, %v0, %v0     // <-
        vo      %v0, %v1, %v0
        vrepig  %v1, 1
        vn      %v0, %v0, %v1
        vpkg    %v24, %v0, %v0     // <-
        br      %r14

The vector merge-high and vector pack instrctions should not be needed it seems
to me:

[ 0 1 2 3] = vceqf %v1
[ 4 5 6 7] = vceqf %v0
[ 0 0 1 1] = vmrhf %v1, %v1
[ 4 4 5 5] = vmrhf %v0, %v0
[ 8 9 a b] = vo; vn
[ 9 b 9 b] = vpgk %v0, %v0

Without the extension to v2i64 and truncation back to v4i32:

[ 0 1 2 3] = vceqf %v1
[ 4 5 6 7] = vceqf %v0
[ 9 b - -] = vo; vn

fun:                                    # @fun
        .cfi_startproc
# %bb.0:
        vgbm    %v0, 0
        vceqf   %v1, %v24, %v0
        vceqf   %v0, %v26, %v0
        vo      %v0, %v1, %v0
        vrepig  %v1, 1
        vn      %v0, %v0, %v1
        br      %r14

I am not sure how to best address this, and perhaps this further improvement
belongs in a separate patch... Right now I see that PromoteIntRes_SETCC()
handles the SETCC node by creating a new SETCC with the getSetCCResultType()
result type, and then doing SExtOrTrunc to TLI.getTypeToTransfomrTo(v2i1).

While in the general case I would think that the promotion of a vector of i1s
must follow the general model, it seems that in certain cases like this should
not have to. More specifically, if the result is only used by a (root)
ZERO_EXTEND the promotion should go directly to that result type.

What if the target could decide on the type to promote to? In this case the
SystemZ target could analyze the DAG and decide that it can promote to v2i32
(and also widen to v4i32?).

Alternatively, maybe this could be done as a target DAGCombine before
TypeLegalizer?
Quuxplusone commented 4 years ago

Hmmm... perhaps this promotion issue really just belongs to the test case at hand: with Elis patch I see no additional vector merge instructions on benchmarks...

Quuxplusone commented 4 years ago
>The insert_vector_elt that fails during isel is:
>
>   t111: v4i32 = BUILD_VECTOR undef:i64, Constant:i64<0>, Constant:i64<0>,
undef:i64
>   t78: i64 = sign_extend t100
> t112: v4i32 = insert_vector_elt t111, t78, Constant:i32<0>
>
>I thought this was a missing pattern for a truncating insert_vector_elt. But I
>suppose this is not worth more attention now since we got rid of that
problem...

Ah, I guess this is because the VLVG{B,H,F} patterns require the element to be
in  register class GR32?  It should be possible to fix this.

>I am not sure how to best address this, and perhaps this further improvement
>belongs in a separate patch...

Agreed.  At first glance, it looks like this something the back-end general
shuffle / pack logic should be able to handle.

However, I'm not sure exactly how this code is being generated.  Where does the
merge high come from in the first place?
Quuxplusone commented 4 years ago
>  Where does the merge high come from in the first place?

Sorry - this was the test case this time:

define <2 x i32> @fun(<2 x i32> %Arg1, <2 x i32> %Arg2) {
  %i3 = icmp eq <2 x i32> %Arg1, zeroinitializer
  %i5 = icmp eq <2 x i32> %Arg2, zeroinitializer
  %i6 = or <2 x i1> %i3, %i5
  %i7 = zext <2 x i1> %i6 to <2 x i32>
  ret <2 x i32> %i7
}

The <2 x i1> is first promoted to <2 x i64>, and then truncated/widened to <4 x
i32>:  The v2i1 SETCC is type-legalized to v4i32 SETCC + v2i64
SIGN_EXTEND_VECTOR_INREG.DAGCombiner then replaces the SIGN_EXTEND_VECTOR_INREG
with ANY_EXTEND_VECTOR_INREG. Vector-legalizer then replaces that with
VECTOR_SHUFFLE<u, 0, u 1>, and then that is lowered to SystemZISD::MERGE_HIGH.
So it's part of the promotion from v2i1 to v2i64 via v4i32.

Elis patch proposed at https://reviews.llvm.org/D86268.
Quuxplusone commented 4 years ago
(In reply to Jonas Paulsson from comment #10)
> The <2 x i1> is first promoted to <2 x i64>, and then truncated/widened to
> <4 x i32>:  The v2i1 SETCC is type-legalized to v4i32 SETCC + v2i64
> SIGN_EXTEND_VECTOR_INREG.DAGCombiner then replaces the
> SIGN_EXTEND_VECTOR_INREG with ANY_EXTEND_VECTOR_INREG. Vector-legalizer then
> replaces that with VECTOR_SHUFFLE<u, 0, u 1>, and then that is lowered to
> SystemZISD::MERGE_HIGH. So it's part of the promotion from v2i1 to v2i64 via
> v4i32.

So we end up with two SETCC + VECTOR_SHUFFLE pairs, followed by the OR,
followed by the truncation, right?  And the truncation is in fact still
implemented elementwise (so a BUILD_VECTOR of two scalar TRUNCATEs of
EXTRACT_VECTOR_ELT)?

That would make the earlier parts difficult to optimize in DAGCombine since
this usually only handles single-use values, and the OR has the two
EXTRACT_VECTOR_ELT uses ...

Maybe for such instances it would make sense to have something like a
TRUNCATE_VECTOR_INREG after all?  Then we would at some point see a sequence of
TRUNCATE_VECTOR_INREG (OR (EXTEND_VECTOR_INREG) (EXTEND_VECTOR_INREG)), which
could be optimized to a single OR in the inner mode.  The equivalent
transformation is already done for scalars, I think.

But that is certainly a separate issue.
Quuxplusone commented 4 years ago

6dc3e22