[loop vectorize] failure to vectorize after GVN load-pre

Quuxplusone commented 5 years ago


Bugzilla Link	PR39235
Status	NEW
Importance	P enhancement
Reported by	Michael Ferguson (mpfergu@gmail.com)
Reported on	2018-10-09 13:42:56 -0700
Last modified on	2020-05-07 03:54:56 -0700
Version	trunk
Hardware	PC All
CC	ayal.zaks@intel.com, florian_hahn@apple.com, hfinkel@anl.gov, hideki.saito@intel.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, rengolin@gmail.com
Fixed by commit(s)
Attachments	`opt-me.ll` (12368 bytes, text/plain)
Blocks
Blocked by
See also	PR43398, PR45823

Created attachment 20981
test showing the problem

I'm seeing an issue where adding !noalias metadata seems to prevents loop
vectorization from occurring. The issue seems to be that GVN does a PRE of a
load and this prevents the loop vectorizer from functioning.

See the attached .ll file. Supposing it's named opt-me.ll, to reproduce you can
do this:

 opt -O3 -loop-vectorize -force-vector-width=4 -force-vector-interleave=1 opt-me.ll -S

Result:

 remark: <unknown>:0:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
remark: <unknown>:0:0: loop not vectorized: value that could not be identified
as reduction is used outside the loop
 -> loop not vectorized

Compare with

opt -O3 -loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -
enable-load-pre=false opt-me.ll -S

 -> loop is vectorized, I see 4 x i32, extractelements, and no loop not vectorized remarks.

If you remove the !noalias metadata from opt-me.ll, vectorization occurs
whether or not -enable-load-pre=false is passed. This seems unfortunate, as
front-ends would like to provide !noalias metadata where possible to improve
(and not hinder!) optimization.

I believe this problem is present in LLVM 6 as well as master.

Quuxplusone commented 5 years ago

Attached opt-me.ll (12368 bytes, text/plain): test showing the problem

Quuxplusone commented 5 years ago

From vectorizer's perspective, PRE went a bit too far, creating backward cross-
iteration dependence that prevents loop vectorization. It should know the
transformation to create %17 PHI would cause a problem with vectorizer. If we
are to vectorize the result of PRE, we'll have to undo that part of PRE, by
deciphering that the memref-ed value in the PHI can be re-evaluated using
induction variable.

Thanks,
Hideki
=====================
Good Case: (redundant memref, but indexed by induction variable)

loop_chpl_3blk_body_:
  ...
  %i_chpl.0 = phi i64 [ %63, %loop_chpl_3blk_body_ ], [ %i_chpl.0.ph, %loop_chpl_3blk_body_.preheader8 ]
  ...
  %63 = add nsw i64 %i_chpl.0, 1
  %64 = getelementptr inbounds i32, i32* %16, i64 %63
  %65 = load i32, i32* %64, align 4, !tbaa !39, !alias.scope !37, !noalias !38, !llvm.mem.parallel_loop_access !19
  store i32 %65, i32* %57, align 4, !tbaa !39, !alias.scope !35, !noalias !36, !llvm.mem.parallel_loop_access !19

Bad Case:  (no redundant memref, but loaded data is merged with Phi)

  %16 = load i32*, i32** %15, align 8, !tbaa !20, !alias.scope !37, !noalias !38, !llvm.mem.parallel_loop_access !19
  %.phi.trans.insert = getelementptr inbounds i32, i32* %16, i64 %.fca.0.load
  %.pre = load i32, i32* %.phi.trans.insert, align 4, !tbaa !39, !alias.scope !37, !noalias !38

loop_chpl_3blk_body_:
  %17 = phi i32 [ %31, %loop_chpl_3blk_body_ ], [ %.pre, %loop_chpl_3blk_body_.preheader ]
  ....
  store i32 %17, i32* %28, align 4, !tbaa !39, !alias.scope !35, !noalias !36, !llvm.mem.parallel_loop_access !19
  %29 = add nsw i64 %i_chpl.0, 1
  %30 = getelementptr inbounds i32, i32* %16, i64 %29
  %31 = load i32, i32* %30, align 4, !tbaa !39, !alias.scope !37, !noalias !38, !llvm.mem.parallel_loop_access !19

Quuxplusone commented 5 years ago

The loop vectorizer does try to cope with such "First-Order Recurrences" often
introduced by GVN, but needs to reschedule instructions in order to succeed in
this case. Specifically, the initial sequence of

loop: (i=0; i++)
  t1 = load from a[i]
  store t1
  t2 = load from a[i+1]
  store t2

is optimized by GVN into

  init = load from a[0]
loop: (i=0; i++)
  t1 = phi(init, t2)
  store t1
  t2 = load from a[i+1]
  store t2

In order to vectorize this successfully, the vectorized value for t1 needs to
be composed from the last element loaded in the previous (vectorized) iteration
combined with all but the last element loaded in this iteration; so this
shuffle needs to be placed after the load from a[i+1], but before the store of
t1 which wants to use it:

  vinit = [undef:undef:undef:a[0]]
vloop: (i=0; i+=4)
  prev_vt2 = phi(vinit, vt2)
  <need to compute vt1 here in time to feed the store, but can't>
  store vt1
  vt2 = load from a[i+1:i+2:i+3:i+4]
  <can compute vt1 only here: vt1 = [prev_vt2[3]:vt2[0]:vt2[1]:vt2[2]]>
  store vt2

This could be achieved by sinking the store of vt1 past the load to vt2,
similar to how casts are sunk in https://reviews.llvm.org/D33058, with
additional checks for memory dependencies:

Index: lib/Analysis/IVDescriptors.cpp
===================================================================
--- lib/Analysis/IVDescriptors.cpp      (revision 345958)
+++ lib/Analysis/IVDescriptors.cpp      (working copy)
@@ -644,6 +644,13 @@
         SinkAfter[I] = Previous;
       return true;
     }
+    if (isa<StoreInst>(I) && (I->getParent() == Phi->getParent())) {
+      if (!DT->dominates(Previous, I)) { // Otherwise we're good w/o sinking.
+        // Check here if I can indeed be sunk to after Previous.
+        SinkAfter[I] = Previous;
+        return true;
+      }
+    }
   }

Quuxplusone commented 5 years ago

I agree with Ayal in trying to detect this pattern in the vectoriser.

This can also come form over-optimised hand-written code, so, making the vectoriser understand them is desirable.

But the problem with the undef approach is that it may misalign loads that, even if not trapping, will still be slower than aligned loads, especially when repeated inside the loop.

A perhaps more invasive, but more profitable transformation is, upon recognising there's no read/write dependencies and no loop-carried dependencies, just make the init load [a0, a1, ..., an-1], if you can guarantee that the loop will execute at least n-1 times.

Quuxplusone commented 5 years ago

The "vinit = [undef:undef:undef:a[0]]" was shorthand for:

...
  %.pre = load i32, i32* %.phi.trans.insert, align 4 ...
...
vector.ph:
  %n.mod.vf = urem i64 %19, 4
  %n.vec = sub i64 %19, %n.mod.vf
  %ind.end = add i64 %.fca.0.load, %n.vec
  %vector.recur.init = insertelement <4 x i32> undef, i32 %.pre, i32 3
  br label %vector.body

There may be other reasons worth preventing/undoing such cross-iteration
dependencies before vectorization, and applying the PRE optimization afterwards
on vectorized loops.

Quuxplusone commented 4 years ago

I've put up a patch to slightly generalize the checks in first order recurrence detection to fix a related issue (https://bugs.llvm.org/show_bug.cgi?id=43398). That should be easily extendable to support this case as well.

Quuxplusone commented 4 years ago

If you know how to easily extend it - a patch doing so or something showing how would be really appreciated! Thanks!

Quuxplusone / LLVMBugzillaTest

[loop vectorize] failure to vectorize after GVN load-pre #38207