Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

loop vectorizer unable to reason with trip count bounds #39931

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR40961
Status NEW
Importance P enhancement
Reported by Kalev Soikonen (oiskuu@gmail.com)
Reported on 2019-03-04 14:18:31 -0800
Last modified on 2021-11-10 07:56:16 -0800
Version trunk
Hardware Other All
CC diegocaballero@google.com, efriedma@quicinc.com, florian_hahn@apple.com, hfinkel@anl.gov, hideki.saito@intel.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, vivekvpandya@gmail.com
Fixed by commit(s)
Attachments scev.patch (7283 bytes, text/plain)
Blocks
Blocked by
See also PR47247, PR52464
I see bugs 38280, 37423, but this one is simpler still. Consider the following:

    void topup(int a[], unsigned long i)
    {
        for (; i < 16; i++) {
            a[i] = 1;
        }
    }

The result, compiled with -O -march=haswell

- loop is vectorized despite the small absolute limit on trip count
- vectorized part has loop step of 256 (!) and is actually unreachable
Quuxplusone commented 5 years ago

What's happening depends on the target: for x86_64-apple-macos we generate a check if I < 16 and a call to memset_pattern16, for x86_64-unknown-linux we generate a big vectorized version as reported.

I guess the problem is the unknown start value of the induction variable.

Quuxplusone commented 5 years ago

This is coming from LoopUtil's getLoopEstimatedTripCount returning "None".

So, some part of the compiler (such as apple-macos calling memset_pattern16) knows trip count of up to 16 is special and other parts (like LV) does not.

I think the notion of estimated trip count information needs to be extended beyond single value.

What can be described by Intel's loopcount pragma may be a good starting point for the discussion: It can represent min/max/ave as well as individual known constant values.

Here, we have 16 or less (no wrap case) ---- or possibly big number (wrap case). I could see why we'd want to distinguish this from i=0;i<N;i++ where N can be an arbitrary big number.

Quuxplusone commented 5 years ago

With simple loops, but where the exact trip count is unknown, it should still be possible to perform an analysis on the initial loop condition: value range of either side of the (a < b) relation can restrict the full range of the type, yielding upper, and lower, limits to the trip count.

I don't know how easy it is to get at...

Upper limit may be useful as a heuristic: loop step of approx. sqrt(limit) to minimize the overall branch count.

Quuxplusone commented 5 years ago

What I'm saying is that this is most likely the issue that can affect other loop optimizations as well. So, any solution we come up with should not be just vectorizer only. We better have some discussions with many loop optimization stakeholders ---- in llvm-dev. Would you like me to start a thread?

Quuxplusone commented 5 years ago

I don't think there's any fundamnental issue to bring up on llvmdev. The vectorizer tries to handle this sort of construct by calling ScalarEvolution::getSmallConstantMaxTripCount(); it just isn't succeeding here for some reason.

Quuxplusone commented 5 years ago

Sorry, you are right. I was thinking "i != 16" case.

Quuxplusone commented 5 years ago
I am trying to find out why ScalarEvolution is not able to give correct back
edge taken count for an expression.

So in this case flow reaches to howFarToZero() and in that function, I have
following expressions as SCEV

Start = (15 + (-1 * %i) (which is set to Distance SCEV)
Step = 1

now, first of all, should I expect Start as ConstantSCEV (15) instead of the
whole expression
the problem here is getUnsignedRangeMax(Distance) returns very large number
because of -1 in the SCEV.

How we can make this work? Here we can clearly say that after 15 steps this
expression will be 0
and thus we have a value for backedge taken count.
Quuxplusone commented 5 years ago

Attached scev.patch (7283 bytes, text/plain): experimental fix

Quuxplusone commented 5 years ago
That patch is taking the wrong approach.

The IR should look something like this:

define void @topup(i32* nocapture %a, i64 %i) {
entry:
  %cmp3 = icmp ult i64 %i, 16
  br i1 %cmp3, label %for.body, label %for.end

for.body:
  %i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
  %arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
  store i32 1, i32* %arrayidx, align 4, !tbaa !2
  %inc = add nuw nsw i64 %i.addr.04, 1
  %exitcond = icmp eq i64 %inc, 16
  br i1 %exitcond, label %for.end, label %for.body

for.end:
  ret void
}

If you look at the IR, you have a loop which is roughly of the form "for (; i
!= 16; ++i)".  In general, for loops of that form, the backedge-taken count is
unrestricted: unsigned arithmetic wraps.  (For example, if i is -10, the loop
runs 26 times.)

There are two factors you could use to prove the max backedge-taken count.
One, the induction variable is marked "nuw".  Two, the loop is guarded by an
"icmp ult" which restricts the initial value of the induction variable.  No
patch can work correctly without using one of those bits of information.
Quuxplusone commented 5 years ago

I experimented with folding the information from the preheader into the expression to compute the max backedge taken count today. I’ll try to share a patch tomorrow.

Quuxplusone commented 5 years ago
(In reply to Eli Friedman from comment #9)
> That patch is taking the wrong approach.
>
> The IR should look something like this:
>
> define void @topup(i32* nocapture %a, i64 %i) {
> entry:
>   %cmp3 = icmp ult i64 %i, 16
>   br i1 %cmp3, label %for.body, label %for.end
>
> for.body:
>   %i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
>   %arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
>   store i32 1, i32* %arrayidx, align 4, !tbaa !2
>   %inc = add nuw nsw i64 %i.addr.04, 1
>   %exitcond = icmp eq i64 %inc, 16
>   br i1 %exitcond, label %for.end, label %for.body
>
> for.end:
>   ret void
> }
>
>
> If you look at the IR, you have a loop which is roughly of the form "for (;
> i != 16; ++i)".  In general, for loops of that form, the backedge-taken
> count is unrestricted: unsigned arithmetic wraps.  (For example, if i is
> -10, the loop runs 26 times.)
>
> There are two factors you could use to prove the max backedge-taken count.
> One, the induction variable is marked "nuw".  Two, the loop is guarded by an
> "icmp ult" which restricts the initial value of the induction variable.  No
> patch can work correctly without using one of those bits of information.

Here i is declared as unsigned so max back-edge taken count should be 16
because for any value greater than 16 loops should not get executed and there
will not be unsigned wrapping of i. So I think without "nuw" based on the loop
guard we should be able to figure out max back edge taken count.
Quuxplusone commented 5 years ago

I think we could achieve the desired result by folding information from the loop guards into the expression we use to compute the max backedge taken count. I've put up a initial patch, which just deals with getting a more precise max count https://reviews.llvm.org/D67178

Quuxplusone commented 5 years ago

So I think without "nuw" based on the loop guard we should be able to figure < > out max back edge taken count.

I thought that's what I said; you need either the nuw or the loop guard (but not both).

Quuxplusone commented 4 years ago

Just a quick update. The loop vectorizer now only uses a vectorization factor of 8 for the example, but the unroller unrolls with vector body 8 times later on

https://godbolt.org/z/MdWWbd