Lops with memcpy suffer from poor LSR treatment on SystemZ.


Bugzilla Link	PR33225
Status	NEW
Importance	P enhancement
Reported by	Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on	2017-05-30 07:37:10 -0700
Last modified on	2017-05-30 07:37:53 -0700
Version	trunk
Hardware	PC Linux
CC	hfinkel@anl.gov, llvm-bugs@lists.llvm.org, paulsson@linux.vnet.ibm.com, quentin.colombet@gmail.com, uweigand@de.ibm.com
Fixed by commit(s)
Attachments	`tc_mvcs.ll` (9896 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 18538
reduced testcase

This is the same issue as with vector load / store: only a 12 bit displacement
is supported with MVC, which implements the memcpy.

I tried to extend LSR to treat this the same way as with a Load or Store, but
that does not work, since it seems that a memcpy Fixup does not get any Offset,
but it is rather the Formula (of type Basic) which has an UnfoldedOffset.

I tried then with

@@ -1290,6 +1290,11 @@ void Cost::RateFormula(const TargetTransformInfo &TTI,
     if ((isa<LoadInst>(Fixup.UserInst) || isa<StoreInst>(Fixup.UserInst)) &&
         !TTI.isFoldableMemAccessOffset(Fixup.UserInst, Offset))
       NumBaseAdds++;
+
+    if (Offset == 0 && F.UnfoldedOffset != 0 &&
+        isa<MemCpyInst>(Fixup.UserInst) &&
+        !TTI.isFoldableMemAccessOffset(Fixup.UserInst, F.UnfoldedOffset))
+      NumBaseAdds++;

, but this did only handle a very few cases. It did add the right cost, but it
seemed that there was no other formula to be rated higher.

It turns out that even though this was a fairly small single-block loop
containing basically 8 memcpy calls, the EstimateSearchSpaceComplexity() still
return a too high value, so the formulas with the foldable offsets
unfortunately got pruned.

The loop becomes:

.LBB0_2:                                # %for.body75
                                        # =>This Inner Loop Header: Depth=1
    lay    %r3, -4096(%r1)
    mvc    0(48,%r1), 0(%r3)
    lay    %r3, -288(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -240(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -192(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -144(%r2)
    mvc    0(48,%r3), 0(%r1)
    mvc    0(48), 0
    mvc    0(48), 0
    mvc    0(48,%r2), 0
    lay    %r1, 8192(%r1)
    la    %r2, 384(%r2)
    j    .LBB0_2

with another heuristic for narrowing the search space (-lsr-exp-narrow), it
seems the better formula is still there:

.LBB0_2:                                # %for.body75
                                        # =>This Inner Loop Header: Depth=1
    lay    %r3, -4096(%r2)
    mvc    0(48,%r1), 0(%r3)
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0(%r2)
    mvc    0(48), 0
    mvc    0(48), 0
    mvc    0(48,%r1), 0
    lay    %r2, 8192(%r2)
    la    %r1, 384(%r1)
    j    .LBB0_2

This actually works even without my little patch per above. Comparing -lsr-exp-
narrow across SPEC, it increases the total number of load-adress instructions
(unfolded offsets), with or without my patch, so it does not seem good to just
switch generally.

Is there any possibility of squeezing in the offset heuristic somewhere in this
pruning process? Is there anything else here that I missed?

Run reduced test case with

llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_mvcs.ll

llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_mvcs.ll -lsr-exp-narrow

Quuxplusone / LLVMBugzillaTest

Lops with memcpy suffer from poor LSR treatment on SystemZ. #32197