[LSR] LSR is producing bad code

Quuxplusone commented 7 years ago


Bugzilla Link	PR32548
Status	NEW
Importance	P enhancement
Reported by	Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on	2017-04-06 02:16:49 -0700
Last modified on	2017-06-07 08:45:29 -0700
Version	trunk
Hardware	PC Linux
CC	atrick@apple.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, paulsson@linux.vnet.ibm.com, quentin.colombet@gmail.com, uweigand@de.ibm.com
Fixed by commit(s)
Attachments	`bad_lsr.ll` (3678 bytes, text/plain) `out_lsr` (12803 bytes, text/plain) `bad_lsr_2.ll` (2535 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 18234
reduced test case

In the spirit of healthy competition between gcc and llvm, I have found a a
regression on a benchmark. At least one of the problems is that LSR is not
producing good code (while gcc indeed does).

On SystemZ, some memory instructions only support immediate offset of 12 bits.
This is the reason for the hook isFoldableMemAccessOffset(), which adds extra
cost if the offset does not fit.

This hook isn't enough to help this test case. LSR produces big offsets even as
extra costs are added for them. I tried adding experimental big costs, and even
tried incrementing NumRegs in LSR when such an offset was present, but this did
not help. I suspect that either the right formula isn't generated in the first
place, or there is some other heuristic that unfortunately takes precedence
somehow.

The cost on SystemZ for a too big offset for a memory access is one extra
instruction to add the offset.

I would appreciate any help on this,

Jonas

Run with:
bin/llc -mtriple=s390x-linux-gnu -mcpu=z13

Quuxplusone commented 7 years ago

Attached bad_lsr.ll (3678 bytes, text/plain): reduced test case

Quuxplusone commented 7 years ago

Attached out_lsr (12803 bytes, text/plain): LSR debug output

Quuxplusone commented 7 years ago

Attached bad_lsr_2.ll (2535 bytes, text/plain): An even smaller test case with 3+1 stores that LSR fails with

Quuxplusone commented 7 years ago

I have experimented for a while with this now, and would like some advice regarding legality checks:

Ideally, it would be nice to split a PHI node and make a new PHI node with a different starting offset from the base pointer, while using the same iv-increment (step). In this way, groups of memory accesses could be formed so that they all can fold the immediate offset.

So, I wonder exactly what checks I must do before making a new IV for a memory access? I don't see LSR calling hasNoSelfWrap(), although this is what I had expected while processing SCEVs for the IVUsers.

In my simple experiment, I take cases of a loop invariant base pointer, add/sub an immediate offset in the preheader, and then add immediate offsets to that iv for the depending memory accesses. I am not sure how the overflow issue should be checked, but I guess that let's say in a C program there was an unsigned index variable, then it must have a preserved overflow behaviour, right?

Thanks for any advice,

Jonas

Quuxplusone commented 7 years ago

Make sure LSR is not "chaining" the stores before solving the formulae. I don't
think this is your problem because I don't see "Final Chain:" in the debug
output. I just wanted make you aware of it because I notice that this code to
check profitable chains does not check if the constant is foldable and probably
should:

    // Incrementing by zero or some constant is neutral. We assume constants can
    // be folded into an addressing mode or an add's immediate operand.
    if (isa<SCEVConstant>(Inc.IncExpr)) {
      ++NumConstIncrements;
      continue;
    }

I don't know why you think LSR won't create phis. Looking at the debug output I
see that different uses have possible solutions with different recurrences:

  LSR Use: Kind=Address of double in addrspace(0), Offsets={0}, widest fixup type: double*
    reg({(1600280 + %dstGrid)<nsw>,+,160}<nsw><%for.body>)
  LSR Use: Kind=Address of double in addrspace(0), Offsets={0}, widest fixup type: double*
    reg({(-1615888 + %dstGrid)<nsw>,+,160}<nsw><%for.body>)

That looks like what you want from your description. So why isn't LSR choosing
those formula? I know it likes to minimize "addrecs", but "79 imm cost" seems
extreme.

Quuxplusone commented 7 years ago

You will be able to tune your target cost model when https://reviews.llvm.org/D30561 lands.

Quuxplusone commented 7 years ago

> I notice that this code to check profitable chains does not check if the
constant is foldable and probably should:
>
>    // Incrementing by zero or some constant is neutral. We assume constants
can
>    // be folded into an addressing mode or an add's immediate operand.
>    if (isa<SCEVConstant>(Inc.IncExpr)) {
>      ++NumConstIncrements;
>      continue;
>    }

I tried your suggestion with

diff --git a/lib/Transforms/Scalar/LoopStrengthReduce.cpp
b/lib/Transforms/Scalar/LoopStrengthReduce.cpp
index b027278..bc836be 100644
--- a/lib/Transforms/Scalar/LoopStrengthReduce.cpp
+++ b/lib/Transforms/Scalar/LoopStrengthReduce.cpp
@@ -2742,9 +2742,14 @@ isProfitableChain(IVChain &Chain,
SmallPtrSetImpl<Instruction*> &Users,

     // Incrementing by zero or some constant is neutral. We assume constants can
     // be folded into an addressing mode or an add's immediate operand.
-    if (isa<SCEVConstant>(Inc.IncExpr)) {
-      ++NumConstIncrements;
-      continue;
+    if (const SCEVConstant *IncConst = dyn_cast<SCEVConstant>(Inc.IncExpr)) {
+      if (IncConst != nullptr &&
+          ((!isa<LoadInst>(Inc.UserInst) && !isa<StoreInst>(Inc.UserInst)) ||
+          TTI.isFoldableMemAccessOffset(Inc.UserInst,
+                                        IncConst->getAPInt().getSExtValue())))
{
+        ++NumConstIncrements;
+        continue;
+      }
     }

     if (Inc.IncExpr == LastIncExpr)
@@ -2998,7 +3003,9 @@ static bool canFoldIVIncExpr(const SCEV *IncExpr,
Instruction *UserInst,
   if (!IncConst || !isAddressUse(UserInst, Operand))
     return false;

-  if (IncConst->getAPInt().getMinSignedBits() > 64)
+  if (IncConst->getAPInt().getMinSignedBits() > 64 ||
+      !TTI.isFoldableMemAccessOffset(UserInst,
+                                     IncConst->getAPInt().getSExtValue()))
     return false;

   MemAccessTy AccessTy = getAccessType(UserInst);
@@ -3007,6 +3014,8 @@ static bool canFoldIVIncExpr(const SCEV *IncExpr,
Instruction *UserInst,
                         IncOffset, /*HaseBaseReg=*/false))
     return false;

, but while some decisions changed regarding chains, I did not see any
improvements in instruction counts of innermost loops (actually getting
somewhat worse overall). Perhaps this just isn't good enough, since per the
comment this is something a smart scheduler ideally should handle. Right now,
IIUC the offsets are just added to the current previous address.

> I don't know why you think LSR won't create phis.
I meant that it will not make use of an extra address register, so that in the
case where there are very varying offsets it would not try to pick just the
best one, but actually use two PHIs if there is an extra reg estimated to be
available. Maybe this could be added as an extra check for the immediate
offsets, so that it may do this split of the found solution in those cases? But
I have been told gcc actually can select a subset of possible PHIs and so use
multiple registers in cases like this...

> That looks like what you want from your description. So why isn't LSR
choosing those formula?

In this test case, there are 3 fixups with huge negative offsets, and 1 fixup
with huge positive offset. It would have been best if LSR picked to offset the
IV with a negative offsets, so that the group of 3 would lie closer the IV.
What instead happens is that the IV starts with the positive offset, and then
there are 3 x (load iv; 64bit subtract) in the loop :-(

The only difference in the formulas seems to be the "imm cost", which I am not
sure exactly what is, and how it relates to foldable immediate offsets..(?):

New best at 4 instructions 1 reg, with addrec cost 1, plus 3 base adds, plus 31
imm cost, plus 1 setup cost.
 Regs: {(-1615888 + %dstGrid)<nsw>,+,160}<nsw><%for.body>

New best at 4 instructions 1 reg, with addrec cost 1, plus 3 base adds, plus 1
setup cost.
 Regs: {(1600280 + %dstGrid)<nsw>,+,160}<nsw><%for.body>

(Again, using two PHIs instead of choosing just one of these would have been
ideal.)

> You will be able to tune your target cost model when
https://reviews.llvm.org/D30561 lands.
Thanks - I guess I should start to experiment with this. It seems that the
immediate offsets are not important enough in the current function.

Quuxplusone commented 7 years ago

LSR should be able to choose a solution involving multiple phis across a set of uses. It probably won't be able to formulate a solution for a single use involving multiple phis if that's what you're after.

Quuxplusone / LLVMBugzillaTest

[LSR] LSR is producing bad code #31520