Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Lops with memcpy suffer from poor LSR treatment on SystemZ. #32197

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR33225
Status NEW
Importance P enhancement
Reported by Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on 2017-05-30 07:37:10 -0700
Last modified on 2017-05-30 07:37:53 -0700
Version trunk
Hardware PC Linux
CC hfinkel@anl.gov, llvm-bugs@lists.llvm.org, paulsson@linux.vnet.ibm.com, quentin.colombet@gmail.com, uweigand@de.ibm.com
Fixed by commit(s)
Attachments tc_mvcs.ll (9896 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 18538
reduced testcase

This is the same issue as with vector load / store: only a 12 bit displacement
is supported with MVC, which implements the memcpy.

I tried to extend LSR to treat this the same way as with a Load or Store, but
that does not work, since it seems that a memcpy Fixup does not get any Offset,
but it is rather the Formula (of type Basic) which has an UnfoldedOffset.

I tried then with

@@ -1290,6 +1290,11 @@ void Cost::RateFormula(const TargetTransformInfo &TTI,
     if ((isa<LoadInst>(Fixup.UserInst) || isa<StoreInst>(Fixup.UserInst)) &&
         !TTI.isFoldableMemAccessOffset(Fixup.UserInst, Offset))
       NumBaseAdds++;
+
+    if (Offset == 0 && F.UnfoldedOffset != 0 &&
+        isa<MemCpyInst>(Fixup.UserInst) &&
+        !TTI.isFoldableMemAccessOffset(Fixup.UserInst, F.UnfoldedOffset))
+      NumBaseAdds++;

, but this did only handle a very few cases. It did add the right cost, but it
seemed that there was no other formula to be rated higher.

It turns out that even though this was a fairly small single-block loop
containing basically 8 memcpy calls, the EstimateSearchSpaceComplexity() still
return a too high value, so the formulas with the foldable offsets
unfortunately got pruned.

The loop becomes:

.LBB0_2:                                # %for.body75
                                        # =>This Inner Loop Header: Depth=1
    lay    %r3, -4096(%r1)
    mvc    0(48,%r1), 0(%r3)
    lay    %r3, -288(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -240(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -192(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -144(%r2)
    mvc    0(48,%r3), 0(%r1)
    mvc    0(48), 0
    mvc    0(48), 0
    mvc    0(48,%r2), 0
    lay    %r1, 8192(%r1)
    la    %r2, 384(%r2)
    j    .LBB0_2

with another heuristic for narrowing the search space (-lsr-exp-narrow), it
seems the better formula is still there:

.LBB0_2:                                # %for.body75
                                        # =>This Inner Loop Header: Depth=1
    lay    %r3, -4096(%r2)
    mvc    0(48,%r1), 0(%r3)
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0(%r2)
    mvc    0(48), 0
    mvc    0(48), 0
    mvc    0(48,%r1), 0
    lay    %r2, 8192(%r2)
    la    %r1, 384(%r1)
    j    .LBB0_2

This actually works even without my little patch per above. Comparing -lsr-exp-
narrow across SPEC, it increases the total number of load-adress instructions
(unfolded offsets), with or without my patch, so it does not seem good to just
switch generally.

Is there any possibility of squeezing in the offset heuristic somewhere in this
pruning process? Is there anything else here that I missed?

Run reduced test case with

llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_mvcs.ll

llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_mvcs.ll -lsr-exp-narrow
Quuxplusone commented 7 years ago

Attached tc_mvcs.ll (9896 bytes, text/plain): reduced testcase