Loop vectorizer assertion failure in truncateToMinimalBitwidths

Quuxplusone commented 6 years ago


Bugzilla Link	PR38110
Status	RESOLVED FIXED
Importance	P normal
Reported by	Ulrich Weigand (uweigand@de.ibm.com)
Reported on	2018-07-09 11:41:10 -0700
Last modified on	2018-12-04 14:12:04 -0800
Version	trunk
Hardware	Other Linux
CC	ayal.zaks@intel.com, hfinkel@anl.gov, hideki.saito@intel.com, james@jamesmolloy.co.uk, llvm-bugs@lists.llvm.org, mario.held@de.ibm.com, mssimpso@codeaurora.org, paulsson@linux.vnet.ibm.com, silviu.baranga@arm.com
Fixed by commit(s)	rL337861
Attachments	`loopvect-bug.ll` (1548 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 20543
Reduced test case

Running the attached test case through

  opt -loop-vectorize -mcpu=z13

on the s390x-ibm-linux target results in an assertion failure:

Unhandled instruction type!
UNREACHABLE executed at /home/uweigand/llvm/llvm-
head/lib/Transforms/Vectorize/LoopVectorize.cpp:3292!

The unhandled instruction at this point seems to be a phi node.

This is a reduced test case from a real-world application.  The corresponding
reduced C source code test case (same assertion failure) is:

void
test (unsigned int width, unsigned char *row,
      unsigned short src, unsigned short *dst)
{
  unsigned char *sp;
  unsigned int i;

  sp = row;
  for (i = 0; i < width; i++, sp++)
    {
      if (*sp == src)
        *sp = (unsigned char) *dst;
    }
}

Quuxplusone commented 6 years ago

Attached loopvect-bug.ll (1548 bytes, text/plain): Reduced test case

Quuxplusone commented 6 years ago

Uli, Hideki, this goes back to Revision 326079.

Before that, LV would bail out early on this loop when targeting Z because its TTI->isLegalMaskedStore() returns false. Now (and since then) it tries to handle it, with a crash.

As a crude work-around one can use -enable-if-conversion=false

Quuxplusone commented 6 years ago

I don't get an assertion fail in my environment. Hmmm.

> opt -loop-vectorize -mcpu=z13 loopvect-bug.ll -S
; ModuleID = 'loopvect-bug.ll'
source_filename = "loopvect-bug.ll"
target datalayout = "E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64"
target triple = "s390x-ibm-linux"

define void @test(i32 zeroext %width, i8* nocapture %row, i16 zeroext %src,
i16* nocapture readonly %dst) #0 {
entry:
  %dst1 = bitcast i16* %dst to i8*
  %cmp10 = icmp eq i32 %width, 0
  br i1 %cmp10, label %for.end, label %for.body.lr.ph

for.body.lr.ph:                                   ; preds = %entry
  %conv1 = zext i16 %src to i32
  %0 = add i32 %width, -1
  %1 = zext i32 %0 to i64
  %2 = add i64 %1, 1
  %min.iters.check = icmp ult i64 %2, 2
  br i1 %min.iters.check, label %scalar.ph, label %vector.memcheck

vector.memcheck:                                  ; preds = %for.body.lr.ph
  %3 = add i32 %width, -1
  %4 = zext i32 %3 to i64
  %5 = add i64 %4, 1
  %scevgep = getelementptr i8, i8* %row, i64 %5
  %uglygep = getelementptr i8, i8* %dst1, i64 1
  %bc = bitcast i16* %dst to i8*
  %bound0 = icmp ult i8* %row, %uglygep
  %bound1 = icmp ult i8* %bc, %scevgep
  %found.conflict = and i1 %bound0, %bound1
  %memcheck.conflict = and i1 %found.conflict, true
  br i1 %memcheck.conflict, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.memcheck
  %n.mod.vf = urem i64 %2, 2
  %n.vec = sub i64 %2, %n.mod.vf
  %cast.crd = trunc i64 %n.vec to i32
  %ind.end = add i32 0, %cast.crd
  %ind.end3 = getelementptr i8, i8* %row, i64 %n.vec
  %broadcast.splatinsert5 = insertelement <2 x i32> undef, i32 %conv1, i32 0
  %broadcast.splat6 = shufflevector <2 x i32> %broadcast.splatinsert5, <2 x i32> undef, <2 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds =
%pred.store.continue10, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %pred.store.continue10 ]
  %6 = trunc i64 %index to i32
  %offset.idx = add i32 0, %6
  %broadcast.splatinsert = insertelement <2 x i32> undef, i32 %offset.idx, i32 0
  %broadcast.splat = shufflevector <2 x i32> %broadcast.splatinsert, <2 x i32> undef, <2 x i32> zeroinitializer
  %induction = add <2 x i32> %broadcast.splat, <i32 0, i32 1>
  %7 = add i32 %offset.idx, 0
  %8 = add i64 %index, 0
  %next.gep = getelementptr i8, i8* %row, i64 %8
  %9 = getelementptr i8, i8* %next.gep, i32 0
  %10 = bitcast i8* %9 to <2 x i8>*
  %wide.load = load <2 x i8>, <2 x i8>* %10, align 1, !alias.scope !0, !noalias !3
  %11 = zext <2 x i8> %wide.load to <2 x i32>
  %12 = icmp eq <2 x i32> %11, %broadcast.splat6
  %13 = extractelement <2 x i1> %12, i32 0
  br i1 %13, label %pred.load.if, label %pred.load.continue

pred.load.if:                                     ; preds = %vector.body
  %14 = load i16, i16* %dst, align 2, !alias.scope !3
  br label %pred.load.continue

pred.load.continue:                               ; preds = %pred.load.if,
%vector.body
  %15 = phi i16 [ undef, %vector.body ], [ %14, %pred.load.if ]
  %16 = extractelement <2 x i1> %12, i32 1
  br i1 %16, label %pred.load.if7, label %pred.load.continue8

pred.load.if7:                                    ; preds = %pred.load.continue
  %17 = load i16, i16* %dst, align 2, !alias.scope !3
  br label %pred.load.continue8

pred.load.continue8:                              ; preds = %pred.load.if7,
%pred.load.continue
  %18 = phi i16 [ undef, %pred.load.continue ], [ %17, %pred.load.if7 ]
  %19 = extractelement <2 x i1> %12, i32 0
  br i1 %19, label %pred.store.if, label %pred.store.continue

pred.store.if:                                    ; preds = %pred.load.continue8
  %20 = trunc i16 %15 to i8
  store i8 %20, i8* %next.gep, align 1, !alias.scope !0, !noalias !3
  br label %pred.store.continue

pred.store.continue:                              ; preds = %pred.store.if,
%pred.load.continue8
  %21 = extractelement <2 x i1> %12, i32 1
  br i1 %21, label %pred.store.if9, label %pred.store.continue10

pred.store.if9:                                   ; preds = %pred.store.continue
  %22 = trunc i16 %18 to i8
  %23 = add i64 %index, 1
  %next.gep4 = getelementptr i8, i8* %row, i64 %23
  store i8 %22, i8* %next.gep4, align 1, !alias.scope !0, !noalias !3
  br label %pred.store.continue10

pred.store.continue10:                            ; preds = %pred.store.if9,
%pred.store.continue
  %index.next = add i64 %index, 2
  %24 = icmp eq i64 %index.next, %n.vec
  br i1 %24, label %middle.block, label %vector.body, !llvm.loop !5

middle.block:                                     ; preds =
%pred.store.continue10
  %cmp.n = icmp eq i64 %2, %n.vec
  br i1 %cmp.n, label %for.end.loopexit, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block,
%vector.memcheck, %for.body.lr.ph
  %bc.resume.val = phi i32 [ %ind.end, %middle.block ], [ 0, %for.body.lr.ph ], [ 0, %vector.memcheck ]
  %bc.resume.val2 = phi i8* [ %ind.end3, %middle.block ], [ %row, %for.body.lr.ph ], [ %row, %vector.memcheck ]
  br label %for.body

for.body:                                         ; preds = %for.inc, %scalar.ph
  %i.012 = phi i32 [ %bc.resume.val, %scalar.ph ], [ %inc, %for.inc ]
  %sp.011 = phi i8* [ %bc.resume.val2, %scalar.ph ], [ %incdec.ptr, %for.inc ]
  %25 = load i8, i8* %sp.011, align 1
  %conv = zext i8 %25 to i32
  %cmp2 = icmp eq i32 %conv, %conv1
  br i1 %cmp2, label %if.then, label %for.inc

if.then:                                          ; preds = %for.body
  %26 = load i16, i16* %dst, align 2
  %conv4 = trunc i16 %26 to i8
  store i8 %conv4, i8* %sp.011, align 1
  br label %for.inc

for.inc:                                          ; preds = %if.then, %for.body
  %inc = add nuw i32 %i.012, 1
  %incdec.ptr = getelementptr inbounds i8, i8* %sp.011, i64 1
  %exitcond = icmp eq i32 %inc, %width
  br i1 %exitcond, label %for.end.loopexit, label %for.body, !llvm.loop !7

for.end.loopexit:                                 ; preds = %middle.block,
%for.inc
  br label %for.end

for.end:                                          ; preds = %for.end.loopexit,
%entry
  ret void
}

attributes #0 = { "target-cpu"="z13" }

!0 = !{!1}
!1 = distinct !{!1, !2}
!2 = distinct !{!2, !"LVerDomain"}
!3 = !{!4}
!4 = distinct !{!4, !2}
!5 = distinct !{!5, !6}
!6 = !{!"llvm.loop.isvectorized", i32 1}
!7 = distinct !{!7, !6}

Quuxplusone commented 6 years ago

I got it reproduced with the compiler built with make check-all.
Let me see.

Quuxplusone commented 6 years ago

Here's the patch. Will upload to Phabricator after more testing.
truncateToMinimalBitwidths() has a big hole, and that fact has nothing to do
with masked load/store change. In the final fix, I'll change unreachable
assertion into continue (i.e., no-op) and ask the original author to review it.

Index: LoopVectorize.cpp
===================================================================
--- LoopVectorize.cpp   (revision 337200)
+++ LoopVectorize.cpp   (working copy)
@@ -3276,7 +3276,7 @@
             SI->getOperand(1), VectorType::get(ScalarTruncatedTy, Elements1));

         NewI = B.CreateShuffleVector(O0, O1, SI->getMask());
-      } else if (isa<LoadInst>(I)) {
+      } else if (isa<LoadInst>(I) || isa<PHINode>(I)) {
         // Don't do anything with the operands, just extend the result.
         continue;
       } else if (auto *IE = dyn_cast<InsertElementInst>(I)) {

Quuxplusone commented 6 years ago

Committed the fix r337861.

https://reviews.llvm.org/D49461

Quuxplusone commented 6 years ago

OK, our original test case now works as well. Thanks!

Quuxplusone commented 5 years ago

> truncateToMinimalBitwidths() has a big hole, and that fact has nothing to do
with masked load/store change.

The original truncateToMinimalBitwidths() clearly was not expecting a phi node,
having treated it as unreachable; why should it, when
computeMinimumValueSizes() which builds MinBWs does not provide them:
"
    // We don't modify the types of PHIs. Reductions will already have been
    // truncated if possible, and inductions' sizes will have been chosen by
    // indvars.
    // If we are required to shrink a PHI, abandon this entire equivalence class.
"

What phi node did this testcase present to truncateToMinimalBitwidths()? It's a
rather rare phi node that getOrCreateVectorValue() returns for a scalarized and
predicated (rather than masked) load, which in turn feeds a vectorized trunc.

> In the final fix, I'll change unreachable assertion into continue (i.e., no-
op)

Truncation to minimal bitwidth is an additional optimization, which indeed
better be avoided when unexpected instructions are encountered, rather than
crashing. But when this occurs best leave a debug and/or ORE message behind
indicating that an optimization opportunity was potentially lost. Specifically
for such phi nodes, its better to truncate their operands (repeatedly), or at-
least leave a TODO behind. This is also generally related to @jonpa's D53865.

Quuxplusone / LLVMBugzillaTest

Loop vectorizer assertion failure in truncateToMinimalBitwidths #37083