[LoopVectorizer/SCEV] induction with truncation prevents vectorization. Need runtime overflow test.

Quuxplusone commented 8 years ago


Bugzilla Link	PR30654
Status	RESOLVED FIXED
Importance	P normal
Reported by	Dorit Nuzman (dorit.nuzman@intel.com)
Reported on	2016-10-10 07:58:58 -0700
Last modified on	2017-12-14 06:18:23 -0800
Version	trunk
Hardware	PC All
CC	andrei.elovikov@intel.com, anemet@apple.com, ehsanamiri@gmail.com, elena.demikhovsky@intel.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, mkuper@google.com, sanjoy@playingwithpointers.com, silviu.baranga@arm.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Saw this missed optimization in a Geekbench workload:
We have a signed int index ‘w_ix’ which is incremented by an unsigned long
‘step’.
When compiling with -m32 all is well.
However when compiling with -m64 the result of the ulong addition may not fit
in back into the sint index and so we may have sint overflow (the index may
wrap).

for.body:
  %w_ix.014 = phi i64 [ %add3, %for.body ], [ 0, %for.body.preheader ]
  %sext = shl i64 %w_ix.014, 32
  %idxprom = ashr exact i64 %sext, 32
  %add3 = add i64 %idxprom, %step

As a result the loop vectorizer fails with
“LV: PHI is not a poly recurrence… Found an unidentified PHI”.
"loop not vectorized: value that could not be identified as reduction is used
outside the loop."

In order to guarantee that the induction behaves nicely we need to identify
this pattern (addition with 64-to-32-bit truncation), and generate a runtime
sint overflow check (e.g. check that step*loopTripCount is small enough).

This is a reduced testcase:

#include <stdlib.h>
float in[1000];
float out[1000];
void test(size_t out_start, size_t size, size_t step)
{
    int w_ix = 0;
    for (size_t out_offset = 0; out_offset < size; ++out_offset)
    {
        size_t out_ix = out_start + out_offset;
        float w = in[w_ix];
        out[out_ix] += w;
        w_ix += step;
    }
}

(I compiled it with -m64 -Ofast -static -march=core-avx2 ).

Quuxplusone commented 7 years ago

I think your analysis is correct (and I've also noticed this a while ago). This should be very easy to fix, we just need to handle trunc in SCEVPredicateRewriter.

Quuxplusone commented 7 years ago

(In reply to comment #1)
> I think your analysis is correct (and I've also noticed this a while ago).
> This should be very easy to fix, we just need to handle trunc in
> SCEVPredicateRewriter.

Thanks a lot for taking a look, Silviu.

Could you please elaborate a bit on the proposed solution?

(actually, don't bother if you're already working on a fix... ;-) otherwise,
please do explain!)

I assume you don't mean we'll add an Overflow check each time the visitor
encounters a trunc(x), but only when the operand 'x' is an AddRecExpr, like in
the sext/zext cases? The problem is that the trunc operand we have at hand is
not an AddRec yet… (we are trying to rewrite it into one…) ?

Also, with the current flow of things I don't see how SCEVPredicateRewriter
will get to visit the trunc even if we added a case for trunc…: When the
SCEVPredicateRewriter is triggered from the call to "PSE.getAsAddRec(Phi)" in
isInductionPhi, the SCEV of our Phi at that point is "%w_ix" (***) ;
SCEVPredicateRewriter will do nothing with "%w_ix"… To make the rewriter see
the trunc we need it to operate on the SCEV of the backedge value (%add3):
(sext i32 (trunc i64 %w_ix to i32) to i64) + %step.

(***)"%w_ix" is the SCEV that createAddRecFromPHI currently returns: it finds
the BackEdge value (%add3), and obtain it's SCEV:
(sext i32 (trunc i64 %w_ix to i32) to i64) + %step;
now it expect that one  of the operands of the SCEVAddExpr will be %w_ix (the
symbolic phi), and since it isn't, it bails out and return UnknownSCEV (%w_ix).
I don't know if it makes sense to change createAddRecFromPHI to return the more
elaborate SCEV instead of %w_ix and feed that to the existing rewriter (and
extend the rewriter to handle this case), or maybe we should write a
specialized "Phi-SCEVPredicateRewriter" (something like a PSCEV version of
createAddRecFromPHI that can add runtime checks)…?

Another thing I'm concerned about is that %idxprom is later also used as a GEP
index, but when we compute the SCEV for the GEP, we obtain the SCEV expressions
for the GEP operands via regular (non-PSCEV) getSCEV calls; in other words, I
don't see where in the current flow of things we would consult with the
RewriteMap to even see the new AddRec SCEV we (hopefully) will have been able
to obtain for %idxprom…? (we have to in order to get an AddRec SCEV for the
GEP...)

Probably I just totally got lost in the SCEV machinery… if things are much
simpler than that, that would be really great… :-)

Quuxplusone commented 7 years ago

Thanks for the detailed explanation! I didn't realize that we weren't able to analyze w_ix. This really complicates things.

The preferred solution should be to teach SCEV to be able to analyze this node. However, I don't see what the expression would be.

Anyway, it looks like this needs more thought.

Quuxplusone commented 7 years ago

I tried a solution (hardcoded hack really) where I made two additions to the
ScevPredictaeRewriter:
- Added a case for visitTruncateExpr
- Extended the case of visitUnknown to make more effort if we are dealing with
a phi; so if the Expr is a Phi we start like createAddRecFromPHI to analyze the
def-use cycle, but we don't give up when we encounter the pattern
sext(trunc(X)) (by adding the necessary overflow check).

This lets us overcome the induction identification problem (!), but this alone
is not sufficient for the overall testcase to get vectorized:

(Just a reminder: We have this IR:
for.body:
  %w_ix.014 = phi i64 [ %add3, %for.body ], [ 0, %for.body.preheader ]
  %sext = shl i64 %w_ix.014, 32
  %idxprom = ashr exact i64 %sext, 32
  %arrayidx = getelementptr @in, 0, %idxprom
…
  %add3 = add i64 %idxprom, %step
…)

So there are several hiccups down the road after passing the phi analysis stage:

1) The memory checks that follow the induction checks are not able to take
advantage of the Predicates we added during the induction analysis stage;
Looks like LoopAccessInfo has its own PSE, and does not have the Predicates
that had already been added by the vectorizer... So when we get to analyze the
GEP whose index is the induction variable in question, we are not able to
obtain an AddRec.
(The call flow is canCreateRTCheck --> hasComputableBounds --> PSE::getSCEV -->
SE.rewriteUsingPredicate, which calls the rewriter with null NewPreds; the
overflow check that we need in order to rewrite %idxprom as an AddRec is not
implied by the existing Predicates because there are no Predicates in LAI's
PSE...);

Looks like we either want to "give" (copy over) predicates to the LAI->PSE (but
that would mean potentially also giving LAI predicates that it doesn't need; we
could try to avoid this I guess); Or, allow the pointer analysis in LAA to be a
bit more aggressive, namely: let it add predicates during the analysis of the
GEP indices (in this specific scenario, the additional predicates that LAA will
add are the same predicates that we added during the Phi analysis, so in the
end we are not creating more runtime checks).

2) The next hiccup is when we try to determine the cost of the GEP; the cost
model calls getAddressAccessSCEV, which uses regular SE->getSCEV() for the
indices, instead of PSE.getSCEV()... So it is not able to obtain an AddRec and
concludes that the address computation is costly and the cost of the scalarized
GEP prevents vectorization.

3) With that, namely:
- extending ScevPredicateRewriter to analyze phi's,
- letting hasComputableBounds call getAsAddRec,
- and using PSE in getAddressAccessSCEV instead of SE,
This loop gets vectorized all the way (!!),But...we generate pretty terrible
code because we actually vectorize all the cast operations in the IR -- the
casts that should be redundant under the SCEV runtime overflow checks. This is
because we don't mark anywhere that these are part of the induction cycle and
should be ignored. So looks like this calls for extending our Induction
Descriptor to hold a set of cast instructions.

So, overall, I think to properly support inductions with type casts such as in
this example, that can be redundant under a runtime overflow check, we need to:

1) Extend the Induction Descriptor to keep the Set of cast instructions that
participate in the induction def-use cycle; and also to record the actual data
type we want to use when we transform the induction (in this example, under the
runtime test we don't want to use the original type of the phi (64bit) but
rather 32bit).

2) Extend the ScevPredicateRewriter to analyze phi's; When it will be called
from isInductionPHI, it will get an empty Set that it could fill up with cast
instructions that participate in the induction. During the visit, we may add
predicates and map them to all the relevant SCEV's that participate in the
induction cycle.

3) Have hasComputableBounds call getAsAddRec instead of getSCEV (or copy over
predicates to LAI??)

4) Use PSE to obtain SCEV's in getAddressAccessSCEV (instead of SE)

5) Extend collectTriviallyDead and collectValuesToIgnore to consider the Set of
casts in the Induction Descriptor (so that the cost model and the widening
stages will skip them)

6) Change Induction.trasnform() to take the data-type from the
InductionDescriptor rather than the phi

7) During widening of the phi node (and also during scalarization of the phi
node) we want to have the vectorized/scalarized Value mapped both to the
original scalar phi, but also to the set of cast instructions (in case they are
used by anyone outside the induction cycle). So they will never be directly
vectorized/scalarized (will be skipped) but they will still have
avectorized/scalarized Value (same one as the phi / the induction update
instruction) if needed.

Hopefully that covers most of what's needed… (I did not implement this, just
hacked my way from one hurdle to the next, so I may be missing things…).

Do you see any issues with any of the above?

Thanks!
Dorit

Quuxplusone commented 7 years ago

The plan for the SCEVPredicateRewriter change makes sense, but there needs to be some caching of the analysis for the SCEVUnknown case. It looks like this should also work for determining the predicated backedge taken count?

1) LAA should be an input to vectorization, so if LAA needs to add some predicates it should do so. The predicates from LAA get copied over to the loop vectorizer, so if LAA decides that some expression is actually an AddRecExpr LV will see it too.

So I guess having LAA be more aggressive and adding predicates should be the way to go. We can add predicates as long as they are needed to avoid an unknown dependence.

2) Ok, that shouldn't be too hard to fix.

3) I had a patch for hasComputableBounds and some other stuff in LAA some time ago but never got committed (https://reviews.llvm.org/D17080). I don't have any plans to work on it in the immediate future, but it might help you (it probably doesn't apply anymore).

For optimizing the casts: is this something similar to what IndVarSimplify does (except we have better SCEV expressions from PSE)?

Anyway, just getting the vectorizer to consider that loop would be a good step forward.

Quuxplusone commented 7 years ago

(In reply to comment #5)

Hi Silviu,

Thanks for the feedback!

> The plan for the SCEVPredicateRewriter change makes sense, but there needs to
> be some caching of the analysis for the SCEVUnknown case.

Yes; For caching across different calls to the rewriter, I was hoping to rely
on the PSE.Preds and its SCEVToPreds map. I'd appreciate your thoughts on this:
Say we have:
%ptr1 with SCEV = "(8*sext(trunc(%xphi))) + @In" ;
and
%ptr2 with SCEV = "(4 + (8*sext(trunc(%xphi)))) + @In" ;

At some point we call getAsAddRec(%ptr1) for the first time;
the rewriter will visitAdd->visitMul->visitSext->visitTrunc->visitUnknown; Now,
the result of the phi analysis in visitUnknownSCEV(%xphi) will be:

(a) Creation of a new AddRec: AR0 = (0, +, sext(trunc(%step))) (*)
(b) Addition of a new Predicate: P0  = { NSSW_flag,  AR0 }     (**)
(c) The predicate P0 will also be mapped to each of the following SCEVs in the
SCEVToPreds map: "%xphi", "trunc(%xphi)", "sext(trunc(%xphi))". And we'll
extend the visitSext, visitTrunc, and visitUnknown cases to first check Pred:
        if (Pred) {
          auto ExprPreds = Pred->getPredicatesForExpr(Expr);
          for (auto *Pred : ExprPreds)
            if (const auto *IPred = dyn_cast<SCEVWrapPredicate>(Pred))
              if (IPred->getLHS() == Expr)
                return IPred->getRHS();
        }

This way, when we later call PSE.getSCEV(%ptr2), the rewriter will immediately
return "(0, +, sext(trunc(%step)))" in visitSext, without adding new predicates
or creating new AddRecs with casts.

The problem is that WrapPredicate is currently meant only for AddRecs... So
maybe we can extend the WrapPredicate somehow to allow us to cache the other
SCEVs there… The idea is that a wrapPredicate will continue to always be
associated with some AddRec AR, but optionally also with non-addRec SCEVs that
can/should be rewritten into that AR… do you think that could work?

More sanity checks while we're here:
(*) In the specific example in this PR, it should be ok for the phi analysis in
SCEVUnknown to return just (0,+,%step), right?
(**) Is NSSW correct/sufficient here? (that is, upon encountering the pattern:
phi->trunc->sext->add->phi during the phi analysis)?

More on caching analysis results:
The above was w.r.t caching across different calls to the rewriter. I think we
also need some extra "internal" caching, that lives only during a single
invocation of the rewriter: So that when we go back the visit* call chain
(visitUnknown->visitTrunc->visitSext->etc.) we won't create/add redundant
AddRecs/Predicates. Otherwise, when visitUnknown will return (0,+,%step) with
predicate p0, visitTrunc will then return (0,+,trunc(%step)) and create
predicate p1, and visitSext will then return (0,+,sext(trunc(%step))) and
create predicate p2... So for that we'll probably need a temporary "RewriteMap"
that the phi-analysis in visitUnknown will fill in, and that
visitSext/visitTrunc will use.

I hope this all/largely makes sense…?

> It looks like this
> should also work for determining the predicated backedge taken count?

will look into that

> 1) LAA should be an input to vectorization, so if LAA needs to add some
> predicates it should do so. The predicates from LAA get copied over to the
> loop vectorizer, so if LAA decides that some expression is actually an
> AddRecExpr LV will see it too.
>
> So I guess having LAA be more aggressive and adding predicates should be the
> way to go. We can add predicates as long as they are needed to avoid an
> unknown dependence.

Great!

> 2) Ok, that shouldn't be too hard to fix.

Right (I just had a small concern that maybe in some places we intentionally
not use PSE for some reason).

> 3) I had a patch for hasComputableBounds and some other stuff in LAA some time
> ago but never got committed (https://reviews.llvm.org/D17080). I don't have
> any plans to work on it in the immediate future, but it might help you (it
> probably doesn't apply anymore).

Thanks a lot for the pointer!! This is exactly what I need... any reason why
you abandoned this work?
I guess it makes sense to push that work as is (I mean, standalone, separately
from the rest of my SCEV/induction changes).

> For optimizing the casts: is this something similar to what IndVarSimplify
> does (except we have better SCEV expressions from PSE)?

I wasn't aware of this pass… but I don't think it is similar to what we want
(please correct me if I'm wrong):

We just want to ignore the casts in the cost model and treat them as dead in
the actual vectorization stage. We want to do that even if the casted values
are used by operations other than the induction itself. Now, if a casted value
(%castedV) is used in an operation that is being widened/scalarized, we want to
find in VectorLoopValueMap[%castedV] the same widened/scalarized value that we
have in VectorLoopValueMap[%phi]. This means that when we widen/scalarize the
phi, we will also initialize the respective entries in VectorLoopValueMap for
all the cast instructions we have gathered in our Induction descriptor.

That's the main thing I had in mind in terms of cast optimizations; I'm not
going to make any IR changes/simplifications in the code before I start
widening.

(BTW, all this works only for the specific pattern at hand, namely: phi->trunc-
>sext->add->phi; If some of the casts use the value of the induction update
instruction (like so: phi->sext->add->trunc->phi) the above will have to be
adjusted...)

Another related thought was whether it makes sense, in the case of the example
at hand, to have the phi analysis in visitUnknownSCEV return a 32bit AddRec:
(trunc(0),+,trunc(%step))... I think it would be correct under the overflow
check that we are creating (right?), but it requires some additional changes in
the vectorizer because now the AddRec types don't match the phi type, and
several places in the vectorizer assume that they do. I am not sure if it makes
sense to do this cast optimization in the SCEVrewriter, or if this is something
that should be optimized separately after the vectorizer (maybe by
IndVarSimplify??).

> Anyway, just getting the vectorizer to consider that loop would be a good step
forward.

Yes. I guess we could try to push this in stages / separate patches:
(Most of these can be pushed in parallel independent of one another):
(1) D17080
(2) use PSE in getAddressAccessSCEV
(3) Add a visitTruncExpr in the SCEVPredicateRewriter
(4) extend the SCEVPredicateRewriter to analyze Phis with casts between the phi
and the induction update instruction
(5) Extend the Induction descriptor + related vectorizer changes, to avoid
vectorizing casts
(6) potentially generalize to additional induction patterns

Thanks a lot!
Dorit

Quuxplusone commented 7 years ago

(In reply to comment #6)
> (In reply to comment #5)
>
> Hi Silviu,
>
> Thanks for the feedback!
>
> > The plan for the SCEVPredicateRewriter change makes sense, but there needs
to
> > be some caching of the analysis for the SCEVUnknown case.
>
> Yes; For caching across different calls to the rewriter, I was hoping to
> rely on the PSE.Preds and its SCEVToPreds map. I'd appreciate your thoughts
> on this:
….
> I hope this all/largely makes sense…?
>

Probably this didn't make much sense because I mixed APIs from different
Predicate classes and quite abused the way Preds are used today. Sorry for not
making sense.

In short, my proposal is to reuse Preds/SCEVToPreds to cache the fact that
"%xphi" can be rewritten to "0,+,%step" (under a predicate).
For that, we could extend the WrapPredicate to also hold a non-AddRec LHS
member, that maps to what is currently the "AR" member.
Alternatively, we could extend the EqualPredicate to support a non-constant RHS
(but also an AddRec SCEV).
Or we could create a new kind of Predicate, say - SCEVToAddRecPredicate, which
would hold the SCEV that can be mapped to an AR under a certain predicate.

Please let me know which of the above makes more sense / or if you think this
is a wrong direction and we should do this caching differently...

>
> > 3) I had a patch for hasComputableBounds and some other stuff in LAA some
time
> > ago but never got committed (https://reviews.llvm.org/D17080). I don't have
> > any plans to work on it in the immediate future, but it might help you (it
> > probably doesn't apply anymore).
>
> Thanks a lot for the pointer!! This is exactly what I need... any reason why
> you abandoned this work?
> I guess it makes sense to push that work as is (I mean, standalone,
> separately from the rest of my SCEV/induction changes).

In the meantime I updated the patch (only tiny change was needed) and retested
it. It actually gives performance benefit on a couple benchmarks (!). So I
would strongly support pushing this! Would you prefer if I repost your patch
for review?

> > Anyway, just getting the vectorizer to consider that loop would be a good
step
> forward.
>
> Yes. I guess we could try to push this in stages / separate patches:
> (Most of these can be pushed in parallel independent of one another):
> (1) D17080
> (2) use PSE in getAddressAccessSCEV
> (3) Add a visitTruncExpr in the SCEVPredicateRewriter

I am not sure what is the correct way to do this; we want the no-overflow check
to be done on the truncated type, right? Like so:???

"
     return Expr;
   }

+  const SCEV *visitTruncateExpr(const SCEVTruncateExpr *Expr) {
+    const SCEV *Operand = visit(Expr->getOperand());
+    const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(Operand);
+    if (AR && AR->getLoop() == L && AR->isAffine()) {
+      const SCEV *Step = AR->getStepRecurrence(SE);
+      Type *Ty = Expr->getType();
+      const SCEV *TruncAR =
+        SE.getAddRecExpr(SE.getTruncateExpr(AR->getStart(), Ty),
+                         SE.getTruncateExpr(Step, Ty), L,
+                         AR->getNoWrapFlags());
+      if (addOverflowAssumption(cast<SCEVAddRecExpr>(TruncAR),
+                   SCEVWrapPredicate::IncrementNSSW))
+        return TruncAR;
+    }
+    return SE.getTruncateExpr(Operand, Expr->getType());
+  }
+
   const SCEV *visitZeroExtendExpr(const SCEVZeroExtendExpr *Expr) {
     const SCEV *Operand = visit(Expr->getOperand());
     const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(Operand);
"

Thanks,
Dorit

Quuxplusone commented 7 years ago

Sorry for taking so long to reply.

I would prefer extending the EqualsPredicate to extending the WrapPredicate, but that might mean having predicates that don't need to be checked. It might be easier to just have a separate container in ScalarEvolution (this will probably all be better figured out during code review).

The code you posted for adding a check for Trunc expressions seems correct to me.

If you have something that works, it might be worth creating a review for it just so people can give comments and get the full picture.

Regarding D17080: the problem previously was that it just took too long to get code review and some other things became more urgent for me. I think it should be just a matter of poking the right people. I'm ok with you commandeering the review in phabricator or re-submitting the patch.

Quuxplusone commented 7 years ago

In fact extending EqualPredicate is probably the way to do it since we need to move this information with a set of predicates. It does require having predicates that don't need to be checked.

I think the code for visitTruncExpr should use SCEVWrapPredicate::IncrementNUSW.

Quuxplusone commented 7 years ago

(In reply to Silviu Baranga from comment #8)

> If you have something that works, it might be worth creating a review for it
> just so people can give comments and get the full picture.
>

sure, will post something shortly (just wanted to increase my level of
confidence in the patch/overall direction a bit. thanks a lot for providing
early feedback!)

> Regarding D17080: the problem previously was that it just took too long to
> get code review and some other things became more urgent for me. I think it
> should be just a matter of poking the right people. I'm ok with you
> commandeering the review in phabricator or re-submitting the patch.

I added a word of encouragement on D17080; Maybe add a mkuper and sajoy as
reviewers? (and then if needed ping a few days later?) (I don't think I can add
diffs/reviewers to a review owned by someone else)

(I think best if you continue to pursue this, if you haven't totally lost
interest in it. But if you prefer, I'll open a new review for it)

Quuxplusone commented 7 years ago

> I added a word of encouragement on D17080; Maybe add a mkuper and sajoy as
> reviewers? (and then if needed ping a few days later?) (I don't think I can
> add diffs/reviewers to a review owned by someone else)
>

Turns out I can. just did.

Quuxplusone commented 7 years ago

Thanks, if I can get reviews on D17080 I can progress it.

Quuxplusone commented 6 years ago

Solved with r320672:

r320672 | dorit | 2017-12-14 09:56:31 +0200 (Thu, 14 Dec 2017) | 27 lines

[LV] Support efficient vectorization of an induction with redundant casts

(https://reviews.llvm.org/D38948)

Quuxplusone / LLVMBugzillaTest

[LoopVectorizer/SCEV] induction with truncation prevents vectorization. Need runtime overflow test. #29627