Open Quuxplusone opened 5 years ago
Attached opt-me.ll
(12368 bytes, text/plain): test showing the problem
From vectorizer's perspective, PRE went a bit too far, creating backward cross-
iteration dependence that prevents loop vectorization. It should know the
transformation to create %17 PHI would cause a problem with vectorizer. If we
are to vectorize the result of PRE, we'll have to undo that part of PRE, by
deciphering that the memref-ed value in the PHI can be re-evaluated using
induction variable.
Thanks,
Hideki
=====================
Good Case: (redundant memref, but indexed by induction variable)
loop_chpl_3blk_body_:
...
%i_chpl.0 = phi i64 [ %63, %loop_chpl_3blk_body_ ], [ %i_chpl.0.ph, %loop_chpl_3blk_body_.preheader8 ]
...
%63 = add nsw i64 %i_chpl.0, 1
%64 = getelementptr inbounds i32, i32* %16, i64 %63
%65 = load i32, i32* %64, align 4, !tbaa !39, !alias.scope !37, !noalias !38, !llvm.mem.parallel_loop_access !19
store i32 %65, i32* %57, align 4, !tbaa !39, !alias.scope !35, !noalias !36, !llvm.mem.parallel_loop_access !19
Bad Case: (no redundant memref, but loaded data is merged with Phi)
%16 = load i32*, i32** %15, align 8, !tbaa !20, !alias.scope !37, !noalias !38, !llvm.mem.parallel_loop_access !19
%.phi.trans.insert = getelementptr inbounds i32, i32* %16, i64 %.fca.0.load
%.pre = load i32, i32* %.phi.trans.insert, align 4, !tbaa !39, !alias.scope !37, !noalias !38
loop_chpl_3blk_body_:
%17 = phi i32 [ %31, %loop_chpl_3blk_body_ ], [ %.pre, %loop_chpl_3blk_body_.preheader ]
....
store i32 %17, i32* %28, align 4, !tbaa !39, !alias.scope !35, !noalias !36, !llvm.mem.parallel_loop_access !19
%29 = add nsw i64 %i_chpl.0, 1
%30 = getelementptr inbounds i32, i32* %16, i64 %29
%31 = load i32, i32* %30, align 4, !tbaa !39, !alias.scope !37, !noalias !38, !llvm.mem.parallel_loop_access !19
The loop vectorizer does try to cope with such "First-Order Recurrences" often
introduced by GVN, but needs to reschedule instructions in order to succeed in
this case. Specifically, the initial sequence of
loop: (i=0; i++)
t1 = load from a[i]
store t1
t2 = load from a[i+1]
store t2
is optimized by GVN into
init = load from a[0]
loop: (i=0; i++)
t1 = phi(init, t2)
store t1
t2 = load from a[i+1]
store t2
In order to vectorize this successfully, the vectorized value for t1 needs to
be composed from the last element loaded in the previous (vectorized) iteration
combined with all but the last element loaded in this iteration; so this
shuffle needs to be placed after the load from a[i+1], but before the store of
t1 which wants to use it:
vinit = [undef:undef:undef:a[0]]
vloop: (i=0; i+=4)
prev_vt2 = phi(vinit, vt2)
<need to compute vt1 here in time to feed the store, but can't>
store vt1
vt2 = load from a[i+1:i+2:i+3:i+4]
<can compute vt1 only here: vt1 = [prev_vt2[3]:vt2[0]:vt2[1]:vt2[2]]>
store vt2
This could be achieved by sinking the store of vt1 past the load to vt2,
similar to how casts are sunk in https://reviews.llvm.org/D33058, with
additional checks for memory dependencies:
Index: lib/Analysis/IVDescriptors.cpp
===================================================================
--- lib/Analysis/IVDescriptors.cpp (revision 345958)
+++ lib/Analysis/IVDescriptors.cpp (working copy)
@@ -644,6 +644,13 @@
SinkAfter[I] = Previous;
return true;
}
+ if (isa<StoreInst>(I) && (I->getParent() == Phi->getParent())) {
+ if (!DT->dominates(Previous, I)) { // Otherwise we're good w/o sinking.
+ // Check here if I can indeed be sunk to after Previous.
+ SinkAfter[I] = Previous;
+ return true;
+ }
+ }
}
I agree with Ayal in trying to detect this pattern in the vectoriser.
This can also come form over-optimised hand-written code, so, making the vectoriser understand them is desirable.
But the problem with the undef approach is that it may misalign loads that, even if not trapping, will still be slower than aligned loads, especially when repeated inside the loop.
A perhaps more invasive, but more profitable transformation is, upon recognising there's no read/write dependencies and no loop-carried dependencies, just make the init load [a0, a1, ..., an-1], if you can guarantee that the loop will execute at least n-1 times.
The "vinit = [undef:undef:undef:a[0]]" was shorthand for:
...
%.pre = load i32, i32* %.phi.trans.insert, align 4 ...
...
vector.ph:
%n.mod.vf = urem i64 %19, 4
%n.vec = sub i64 %19, %n.mod.vf
%ind.end = add i64 %.fca.0.load, %n.vec
%vector.recur.init = insertelement <4 x i32> undef, i32 %.pre, i32 3
br label %vector.body
There may be other reasons worth preventing/undoing such cross-iteration
dependencies before vectorization, and applying the PRE optimization afterwards
on vectorized loops.
I've put up a patch to slightly generalize the checks in first order recurrence detection to fix a related issue (https://bugs.llvm.org/show_bug.cgi?id=43398). That should be easily extendable to support this case as well.
If you know how to easily extend it - a patch doing so or something showing how would be really appreciated! Thanks!
opt-me.ll
(12368 bytes, text/plain)