BDNZ not generated for simple loops

Quuxplusone commented 8 years ago


Bugzilla Link	PR28363
Status	REOPENED
Importance	P normal
Reported by	Ehsan Amiri (ehsanamiri@gmail.com)
Reported on	2016-06-29 13:16:16 -0700
Last modified on	2016-09-19 13:15:27 -0700
Version	trunk
Hardware	PC Windows NT
CC	hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, nemanja.i.ibm@gmail.com, sanjoy@playingwithpointers.com
Fixed by commit(s)
Attachments	`vec.ll` (9622 bytes, application/octet-stream) `novec.ll` (6439 bytes, application/octet-stream)
Blocks
Blocked by
See also	PR28429

For many loops, sometimes very simple ones like:

     for (i = 0; i < m; i++)
        sum += b[i]*a[i];

Where sum, a and b are single precision float (arrays), we do not generate
bdnz, and instead use, three instructions in the loop. (add, compare, branch-
conditional).

I looked at one of the loops like this and Scalar Evolution was not able to
compute loop iteration count. When the loop is not unrolled (compile with -fno-
unroll-loops) this is not an issue. Also for the residue loops of an unrolled
loop we generate bdnz.

Quuxplusone commented 8 years ago

In some cases, in addition to add and compare instruction, a mr instruction is
generated too, and we use two registers to keep track of induction variable.
The code pattern looks like this (intervening instructions removed):

mr      r22,r21
addi    r21,r22,4
cmplw   cr6,r12,r21
bne     cr6,10001120

Quuxplusone commented 8 years ago

Scalar Evolution makes it to ScalarEvolution::computeExitLimitFromICmp where it
creates SCEVs for LHS and RHS of the compare instruction. These expressions is
sound reasonable (loop is unrolled by a factor of 8) :

{8,+,8}<%for.body>
((-1 * (zext i3 (trunc i32 %m to i3) to i32))<nsw> + %m)

The subsequent analysis encounters a problem and concludes that the upper bound
cannot be computed. I have to look at the rest of the work in
computeExitLimitFromICmp  to see what goes wrong.

Quuxplusone commented 8 years ago

A couple of earlier passes seem to introduce some questionable "trunc"s in the
loop control logic which confuses SCEV.

The first one is ind var simplify. Sanjoy D suggsted the following code change.

 s/IsSigned == Cmp->isSigned()/IsSigned == Cmp->isSigned() || Cmp->isEquality()/

in WidenIV::widenLoopCompare.

The second one is in loop strength reduction.

Adding Sanjoy's suggested fix, and disabling LSR fixes the problem.

Quuxplusone commented 8 years ago

(In reply to comment #3)

Ignore this comment. I had problems reproducing it. It seems that the truncs generated during LFTR cause the problem.

Quuxplusone commented 8 years ago

fixed with
https://reviews.llvm.org/rL278421 (D23075)

Quuxplusone commented 8 years ago

The problem still exists for some simple loops such as this:

      for( i = 0 ; i < n ; i++ )
      {
         a[i] += b[i] + d[i];;
      }
(all array data types float)

again SCEV is unable to compute the iteration count of the loop: (from debug
output):

Exit Count for Loop at depth 1 containing: %vector.body<header><latch><exiting>
 from block vector.body: ***COULDNOTCOMPUTE***

Quuxplusone commented 8 years ago

(In reply to comment #6)
> The problem still exists for some simple loops such as this:
>
>       for( i = 0 ; i < n ; i++ )
>       {
>          a[i] += b[i] + d[i];;
>       }
> (all array data types float)
>
> again SCEV is unable to compute the iteration count of the loop: (from debug
> output):
>
> Exit Count for Loop at depth 1 containing:
> %vector.body<header><latch><exiting>
>  from block vector.body: ***COULDNOTCOMPUTE***

That looks like a vectorized loop, so it isn't too surprising that
SCEV could not compute the trip count here.  Have you tried looking at
debug output after disabling vectorization and unrolling?

Quuxplusone commented 8 years ago

> That looks like a vectorized loop, so it isn't too surprising that
> SCEV could not compute the trip count here.  Have you tried looking at
> debug output after disabling vectorization and unrolling?

Yes, if I disable both vectorization and unrolling, then bdnz is generated.
Disabling vectorization alone is not enough.

Do we know why SCEV cannot compute the iteration count when the loop is
unrolled/vectorized?

Quuxplusone commented 8 years ago

(In reply to comment #8)
> > That looks like a vectorized loop, so it isn't too surprising that
> > SCEV could not compute the trip count here.  Have you tried looking at
> > debug output after disabling vectorization and unrolling?
>
> Yes, if I disable both vectorization and unrolling, then bdnz is generated.
> Disabling vectorization alone is not enough.
>
> Do we know why SCEV cannot compute the iteration count when the loop is
> unrolled/vectorized?

I will have to take a look at the loop to give a concrete reason.  Can
you please extract out the loop with the missing trip count and either
put it on this bug or on its own bug?

Also, I'm not familiar with PPC; can you specify the relation between
BDNZ and the trip count?  Is BDNZ a loop instruction, like x86's REP?

Quuxplusone commented 8 years ago

> I will have to take a look at the loop to give a concrete reason.  Can
> you please extract out the loop with the missing trip count and either
> put it on this bug or on its own bug?
>
> Also, I'm not familiar with PPC; can you specify the relation between
> BDNZ and the trip count?  Is BDNZ a loop instruction, like x86's REP?

Thanks Sanjoy. I am not familiar with REP so here is a description of what BDNZ
is:

There is a register CTR on PPC that you can initialize before the loop. Then
each bdnz instruction will decrement that register and will branch if it is non-
zero. So we save a couple of instructions and use one less register. (The
instruction has some variants as well. I use bdnz to represent any proper insn
in this category).

So by extracting the loop, do you mean copying loops'IR just before the
transformation that generates BDNZ? If yes, I will put that in the next comment
(or send by email if it is too long, to avoid polluting this item).

Quuxplusone commented 8 years ago

(In reply to comment #10)
> So by extracting the loop, do you mean copying loops'IR just before the
> transformation that generates BDNZ? If yes, I will put that in the next
> comment (or send by email if it is too long, to avoid polluting this item).

Yes, that will do as a starting point.  Basically I'm looking for
whatever IR illustrates the problem you have with trip counts, without
involving the PPC backend.

Note: if the IR file is large, you can always attach it to the bug
instead of pasting it into a comment.  That will be better than a
one-off personal email.

Quuxplusone commented 8 years ago

I have to run now. Will attach the source code and IR later.(In reply to
comment #11)
> (In reply to comment #10)
> > So by extracting the loop, do you mean copying loops'IR just before the
> > transformation that generates BDNZ? If yes, I will put that in the next
> > comment (or send by email if it is too long, to avoid polluting this item).
>
> Yes, that will do as a starting point.  Basically I'm looking for
> whatever IR illustrates the problem you have with trip counts, without
> involving the PPC backend.
>
> Note: if the IR file is large, you can always attach it to the bug
> instead of pasting it into a comment.  That will be better than a
> one-off personal email.

Thanks. Yes, I have to run now. Will attach IR later.

Quuxplusone commented 8 years ago

Attached vec.ll (9622 bytes, application/octet-stream): vectorized code

Quuxplusone commented 8 years ago

Attached novec.ll (6439 bytes, application/octet-stream): compiled with -fno-vectorize

Quuxplusone commented 8 years ago

LSR creates an IV, and relies on SCEVExpander::expandIVInc to create IV bump insn. The calls to CreateAdd and CreateSub do not pass any value for nsw and nuw flags. So it defaults to false.

The same is true for loop vectorizer when it creates an IV in InnerLoopVectorizer::createInductionVariable.

When I force the calls to CreateAdd in these places to set nsw and nuw, to true, we generate BDNZ.

I don't know, if this is an oversight, or there is a good reason for avoiding setting this flag.

Quuxplusone / LLVMBugzillaTest

BDNZ not generated for simple loops #28363