Open Quuxplusone opened 5 years ago
What's happening depends on the target: for x86_64-apple-macos we generate a check if I < 16 and a call to memset_pattern16, for x86_64-unknown-linux we generate a big vectorized version as reported.
I guess the problem is the unknown start value of the induction variable.
This is coming from LoopUtil's getLoopEstimatedTripCount returning "None".
So, some part of the compiler (such as apple-macos calling memset_pattern16) knows trip count of up to 16 is special and other parts (like LV) does not.
I think the notion of estimated trip count information needs to be extended beyond single value.
What can be described by Intel's loopcount pragma may be a good starting point for the discussion: It can represent min/max/ave as well as individual known constant values.
Here, we have 16 or less (no wrap case) ---- or possibly big number (wrap case). I could see why we'd want to distinguish this from i=0;i<N;i++ where N can be an arbitrary big number.
With simple loops, but where the exact trip count is unknown, it should still be possible to perform an analysis on the initial loop condition: value range of either side of the (a < b) relation can restrict the full range of the type, yielding upper, and lower, limits to the trip count.
I don't know how easy it is to get at...
Upper limit may be useful as a heuristic: loop step of approx. sqrt(limit) to minimize the overall branch count.
What I'm saying is that this is most likely the issue that can affect other loop optimizations as well. So, any solution we come up with should not be just vectorizer only. We better have some discussions with many loop optimization stakeholders ---- in llvm-dev. Would you like me to start a thread?
I don't think there's any fundamnental issue to bring up on llvmdev. The vectorizer tries to handle this sort of construct by calling ScalarEvolution::getSmallConstantMaxTripCount(); it just isn't succeeding here for some reason.
Sorry, you are right. I was thinking "i != 16" case.
I am trying to find out why ScalarEvolution is not able to give correct back
edge taken count for an expression.
So in this case flow reaches to howFarToZero() and in that function, I have
following expressions as SCEV
Start = (15 + (-1 * %i) (which is set to Distance SCEV)
Step = 1
now, first of all, should I expect Start as ConstantSCEV (15) instead of the
whole expression
the problem here is getUnsignedRangeMax(Distance) returns very large number
because of -1 in the SCEV.
How we can make this work? Here we can clearly say that after 15 steps this
expression will be 0
and thus we have a value for backedge taken count.
Attached scev.patch
(7283 bytes, text/plain): experimental fix
That patch is taking the wrong approach.
The IR should look something like this:
define void @topup(i32* nocapture %a, i64 %i) {
entry:
%cmp3 = icmp ult i64 %i, 16
br i1 %cmp3, label %for.body, label %for.end
for.body:
%i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
store i32 1, i32* %arrayidx, align 4, !tbaa !2
%inc = add nuw nsw i64 %i.addr.04, 1
%exitcond = icmp eq i64 %inc, 16
br i1 %exitcond, label %for.end, label %for.body
for.end:
ret void
}
If you look at the IR, you have a loop which is roughly of the form "for (; i
!= 16; ++i)". In general, for loops of that form, the backedge-taken count is
unrestricted: unsigned arithmetic wraps. (For example, if i is -10, the loop
runs 26 times.)
There are two factors you could use to prove the max backedge-taken count.
One, the induction variable is marked "nuw". Two, the loop is guarded by an
"icmp ult" which restricts the initial value of the induction variable. No
patch can work correctly without using one of those bits of information.
I experimented with folding the information from the preheader into the expression to compute the max backedge taken count today. I’ll try to share a patch tomorrow.
(In reply to Eli Friedman from comment #9)
> That patch is taking the wrong approach.
>
> The IR should look something like this:
>
> define void @topup(i32* nocapture %a, i64 %i) {
> entry:
> %cmp3 = icmp ult i64 %i, 16
> br i1 %cmp3, label %for.body, label %for.end
>
> for.body:
> %i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
> %arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
> store i32 1, i32* %arrayidx, align 4, !tbaa !2
> %inc = add nuw nsw i64 %i.addr.04, 1
> %exitcond = icmp eq i64 %inc, 16
> br i1 %exitcond, label %for.end, label %for.body
>
> for.end:
> ret void
> }
>
>
> If you look at the IR, you have a loop which is roughly of the form "for (;
> i != 16; ++i)". In general, for loops of that form, the backedge-taken
> count is unrestricted: unsigned arithmetic wraps. (For example, if i is
> -10, the loop runs 26 times.)
>
> There are two factors you could use to prove the max backedge-taken count.
> One, the induction variable is marked "nuw". Two, the loop is guarded by an
> "icmp ult" which restricts the initial value of the induction variable. No
> patch can work correctly without using one of those bits of information.
Here i is declared as unsigned so max back-edge taken count should be 16
because for any value greater than 16 loops should not get executed and there
will not be unsigned wrapping of i. So I think without "nuw" based on the loop
guard we should be able to figure out max back edge taken count.
I think we could achieve the desired result by folding information from the loop guards into the expression we use to compute the max backedge taken count. I've put up a initial patch, which just deals with getting a more precise max count https://reviews.llvm.org/D67178
So I think without "nuw" based on the loop guard we should be able to figure < > out max back edge taken count.
I thought that's what I said; you need either the nuw or the loop guard (but not both).
Just a quick update. The loop vectorizer now only uses a vectorization factor of 8 for the example, but the unroller unrolls with vector body 8 times later on
scev.patch
(7283 bytes, text/plain)