Hang when generating LLVM code with Julia typo

jakobnissen commented 3 years ago

I found a typo which causes LLVM codegen to hang:

const v256 = NTuple{32, VecElement{UInt8}}
const ZERO_v256 = ntuple(i -> VecElement{UInt8}(0x00), 32)

@inline function vpcmpeqb(a::v256, b::v256)
    Base.lvmcall( """%res = icmp eq <32 x i8> %0, %1   # NB! NOTE TYPO IN "LVMCALL"
    %resb = sext <32 x i1> %res to <32 x i8>
    ret <32 x i8> %resb
    """, v256, Tuple{v256, v256}, a, b)
end

@inline function vpmovmskb(v::v256)
    eqzero = vpcmpeqb(v, ZERO_v256)
    packed = ccall("llvm.x86.avx2.pmovmskb", llvmcall, UInt32, (NTuple{32, VecElement{UInt8}},), eqzero)
    return leading_ones(packed)
end

vpmovmskb(ZERO_v256) # hangs

Correcting the lvmcall to llvmcall on a machine with the AVX2 instruction set causes the code to behave correctly. Doing @code_llvm vpmovmskb(ZERO_v256) or @code_native vpmovmskb(ZERO_v256) also hangs.

julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = micro
  JULIA_NUM_THREADS = 8

maleadt commented 3 years ago

The hang reproduces with just:

ccall("llvm.x86.avx2.pmovmskb", llvmcall, UInt32, (NTuple{32, VecElement{UInt8}},), eqzero)

So the typo isn't related.

chriselrod commented 3 years ago

I get:

julia> v = ntuple(Core.VecElement{UInt8}, Val(32));

julia> ccall("llvm.x86.avx2.pmovmskb", llvmcall, UInt32, (NTuple{32, VecElement{UInt8}},), v)
LLVM ERROR: Instruction Combining seems stuck in an infinite loop after 1000 iterations.

signal (6): Aborted

Keno commented 3 years ago

Sounds like a monotonicity violation in LLVM's instcombine code for these intrinsics.

Keno commented 3 years ago

I think this may be a nice issue for somebody who wants to get their feet wet with LLVM, so here's how to get started with reproducing this on LLVM upstream, since it's fairly self contained. If nobody takes it within a couple of days, I'll get to it myself:

$ git clone https://github.com/llvm/llvm-project
$ mkdir llvm-project-build
$ cd llvm-project-build
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug ../llvm-project/llvm
$ cat test.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:10:11:12:13"
target triple = "x86_64-unknown-linux-gnu"

declare i32 @llvm.x86.avx2.pmovmskb(<32 x i8>)
declare {}*** @julia.ptls_states()
declare nonnull {}* @jl_apply_generic({}*, {}**, i32) #0
; Function Attrs: norecurse nounwind readnone
declare nonnull {}* @julia.typeof({}*) #1

define i32 @julia_f_179({}* nonnull readonly %0) {
top:
  %v = alloca {}*, align 8
  %1 = call {}*** @julia.ptls_states()
  store {}* null, {}** %v, align 8
  %2 = bitcast {}*** %1 to {}**
  %3 = getelementptr inbounds {}*, {}** %2, i64 4
  %4 = bitcast {}** %3 to i64**
  %5 = load i64*, i64** %4, align 8
  store {}* %0, {}** %v, align 8
  %6 = load {}*, {}** %v, align 8
  %7 = call cc37 nonnull {}* bitcast ({}* ({}*, {}**, i32)* @jl_apply_generic to {}* ({}*, {}*, {}*)*)({}* inttoptr (i64 140171615285360 to {}*), {}* inttoptr (i64 140171511894928 to {}*), {}* %6)
  %8 = call {}* @julia.typeof({}* %7)
  %9 = icmp eq {}* %8, inttoptr (i64 140171511894928 to {}*)
  %10 = zext i1 %9 to i8
  %11 = trunc i8 %10 to i1
  %12 = xor i1 %11, true
  br i1 %12, label %L6, label %L4

L4:                                               ; preds = %top
  %13 = bitcast {}* %7 to <32 x i8>*
  %14 = load <32 x i8>, <32 x i8>* %13, align 16
  br label %L8

L6:                                               ; preds = %top
  %15 = call cc37 nonnull {}* bitcast ({}* ({}*, {}**, i32)* @jl_apply_generic to {}* ({}*, {}*, {}*)*)({}* inttoptr (i64 140171616965680 to {}*), {}* inttoptr (i64 140171511894928 to {}*), {}* %7)
  %16 = bitcast {}* %15 to <32 x i8>*
  %17 = load <32 x i8>, <32 x i8>* %16, align 16
  br label %L8

L8:                                               ; preds = %L6, %L4
  %value_phi = phi <32 x i8> [ %14, %L4 ], [ %17, %L6 ]
  %18 = call i32 @llvm.x86.avx2.pmovmskb(<32 x i8> %value_phi) [ "jl_roots"({}* %7) ]
  ret i32 %18
}
$ ./bin/opt -print-after-all -instcombine ~/avx2bug/test.ll 
WARNING: You're attempting to print out a bitcode file.
This is inadvisable as it may cause display problems. If
you REALLY want to taste LLVM bitcode first-hand, you
can force output with the `-f' option.

LLVM ERROR: Instruction Combining seems stuck in an infinite loop after 100 iterations.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.  Program arguments: ./bin/opt -print-after-all -instcombine /home/keno/avx2bug/test.ll 
1.  Running pass 'Function Pass Manager' on module '/home/keno/avx2bug/test.ll'.
2.  Running pass 'Combine redundant instructions' on function '@julia_f_179'

Keno commented 3 years ago

(I previously posted an incorrect reproduction step, the comment is updated to reflect the actual reproducer)

Keno commented 3 years ago

The problem seems to be the roots bundle on the call:

; ModuleID = './bugpoint-reduced-simplified.bc'
source_filename = "/home/keno/avx2bug/test.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:10:11:12:13"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: nounwind readnone
declare i32 @llvm.x86.avx2.pmovmskb(<32 x i8>) #0

declare nonnull {}* @jl_apply_generic({}*, {}**, i32)

define void @julia_f_179() {
top:
  %0 = call cc37 nonnull {}* bitcast ({}* ({}*, {}**, i32)* @jl_apply_generic to {}* ({}*, {}*, {}*)*)({}* inttoptr (i64 140171616965680 to {}*), {}* inttoptr (i64 140171511894928 to {}*), {}* undef)
  %1 = bitcast {}* %0 to <32 x i8>*
  %2 = load <32 x i8>, <32 x i8>* %1, align 16
  %3 = call i32 @llvm.x86.avx2.pmovmskb(<32 x i8> %2) [ "jl_roots"({}* undef) ]
  ret void
}

attributes #0 = { nounwind readnone }

fails, but if you remove the roots bundle, it goes through ok. Seems like it doesn't recognize that the IR is unchanged. Should be straightforward to fix. Who wants to take a crack?

chriselrod commented 3 years ago

I wouldn't mind working on this. I've built llvm and can reproduce the error disappearing when commenting out [ "jl_roots"({}* undef) ].

What is a "roots bundle"? I guess the approach to fixing this would be to first find out why it doesn't recognize that the IR is unchanged?

Keno commented 3 years ago

The roots bundle is custom metadata we insert in the frontend that tells our later GC lowering passes that a certain value needs to be preserved during a call (in this case the ccall is an intrinsic, but we don't know that during the frontend). So likely something in instcombine is trying to do something with the intrinsic, but bailing once it sees that operand bundle, but at that point it has already decided that it modified something. The thing to do would be to gdb (or rr - which is what I would do) opt and figure out where in instcombine it's trying to modify the intrinsic and why it thinks it modified it.

miguelraz commented 3 years ago

Bookmarking if @chriselrod wants to punt on this later on. Would love to dive into it later in the week though.

chriselrod commented 3 years ago

Sorry for taking so long to get back to this. This comment is verbose because I have no idea what I'm doing.

I'm on (probably a better way to do this?)

> git log | head -n1                                                                                       (base)
commit 47f784ace6bb43eb9d95277fcc847fb82abf0f7a

I talk about line numbers a lot, so the specific commit might help. I copy/paste the important functions too for good measure, but they don't have line numbers, so I do at least try and describe what part of the function they correspond to. Copy/pastes from rr/gdb of course show line numbers and the associated code on that line.

I used rr to record and replay ./bin/opt -print-after-all -instcombine ~/avx2bug/test.ll like you suggested. I set a breakpoint here:

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3894

which corresponds to the first line of the while (true) loop in this function:

static bool combineInstructionsOverFunction(
    Function &F, InstCombineWorklist &Worklist, AliasAnalysis *AA,
    AssumptionCache &AC, TargetLibraryInfo &TLI, TargetTransformInfo &TTI,
    DominatorTree &DT, OptimizationRemarkEmitter &ORE, BlockFrequencyInfo *BFI,
    ProfileSummaryInfo *PSI, unsigned MaxIterations, LoopInfo *LI) {
  auto &DL = F.getParent()->getDataLayout();
  MaxIterations = std::min(MaxIterations, LimitMaxIterations.getValue());

  /// Builder - This is an IRBuilder that automatically inserts new
  /// instructions into the worklist when they are created.
  IRBuilder<TargetFolder, IRBuilderCallbackInserter> Builder(
      F.getContext(), TargetFolder(DL),
      IRBuilderCallbackInserter([&Worklist, &AC](Instruction *I) {
        Worklist.add(I);
        if (match(I, m_Intrinsic<Intrinsic::assume>()))
          AC.registerAssumption(cast<CallInst>(I));
      }));

  // Lower dbg.declare intrinsics otherwise their value may be clobbered
  // by instcombiner.
  bool MadeIRChange = false;
  if (ShouldLowerDbgDeclare)
    MadeIRChange = LowerDbgDeclare(F);

  // Iterate while there is work to do.
  unsigned Iteration = 0;
  while (true) {
    ++NumWorklistIterations;
    ++Iteration;

    if (Iteration > InfiniteLoopDetectionThreshold) {
      report_fatal_error(
          "Instruction Combining seems stuck in an infinite loop after " +
          Twine(InfiniteLoopDetectionThreshold) + " iterations.");
    }

    if (Iteration > MaxIterations) {
      LLVM_DEBUG(dbgs() << "\n\n[IC] Iteration limit #" << MaxIterations
                        << " on " << F.getName()
                        << " reached; stopping before reaching a fixpoint\n");
      break;
    }

    LLVM_DEBUG(dbgs() << "\n\nINSTCOMBINE ITERATION #" << Iteration << " on "
                      << F.getName() << "\n");

    MadeIRChange |= prepareICWorklistFromFunction(F, DL, &TLI, Worklist);

    InstCombinerImpl IC(Worklist, Builder, F.hasMinSize(), AA, AC, TLI, TTI, DT,
                        ORE, BFI, PSI, DL, LI);
    IC.MaxArraySizeForCombine = MaxArraySize;

    if (!IC.run())
      break;

    MadeIRChange = true;
  }

  return MadeIRChange;
}

I.e., ++NumWorklistIterations;. Although when I press c, it brings me to line 3895 instead of 3894, but that doesn't really matter here.

Poking around a little:

(rr) p Iteration
$15 = 0
(rr) p MadeIRChange
$16 = false
(rr) p Worklist
$17 = (llvm::InstCombineWorklist &) @0x55f64c411f20: {
  Worklist = {<llvm::SmallVectorImpl<llvm::Instruction*>> = {<llvm::SmallVectorTemplateBase<llvm::Instruction*, true>> = {<llvm::SmallVectorTemplateCommon<llvm::Instruction*, void>> = {<llvm::SmallVectorBase<unsigned int>> = {BeginX = 0x55f64c411f30, Size = 0,
            Capacity = 256}, <No data fields>}, <No data fields>}, <No data fields>}, <llvm::SmallVectorStorage<llvm::Instruction*, 256>> = {
      InlineElts = "\000\000\000\000\000\000\000\000\220\026DL\366U", '\000' <repeats 26 times>, "\060*DL\366U\000\000\300\fAL\366U\000\000\320\033DL\366U", '\000' <repeats 18 times>, "\320\003DL\366U", '\000' <repeats 18 times>, "\300\354@L\366U\000\000`\345AL\366U\000\000\000\000\000\000\000\000\000\000@\307AL\366U\000\000\240\303@L\366U\000\000\000\000\000\000\000\000\000\000P\230@L\366U\000\000\340J>L\366U\000\000\000\261AL\366U\000\000\360\275@L\366U\000\000\000\000\000\000\000\000\000\000\240\341AL\366U\000\000"...}, <No data fields>},
  WorklistMap = {<llvm::DenseMapBase<llvm::DenseMap<llvm::Instruction*, unsigned int, llvm::DenseMapInfo<llvm::Instruction*>, llvm::detail::DenseMapPair<llvm::Instruction*, unsigned int> >, llvm::Instruction*, unsigned int, llvm::DenseMapInfo<llvm::Instruction*>, llvm::detail::DenseMapPair<llvm::Instruction*, unsigned int> >> = {<llvm::DebugEpochBase> = {
        Epoch = 0}, <No data fields>}, Buckets = 0x0, NumEntries = 0, NumTombstones = 0, NumBuckets = 0},
  Deferred = {<llvm::SetVector<llvm::Instruction*, llvm::SmallVector<llvm::Instruction*, 16>, llvm::SmallDenseSet<llvm::Instruction*, 16, llvm::DenseMapInfo<llvm::Instruction*> > >> = {
      set_ = {<llvm::detail::DenseSetImpl<llvm::Instruction*, llvm::SmallDenseMap<llvm::Instruction*, llvm::detail::DenseSetEmpty, 16, llvm::DenseMapInfo<llvm::Instruction*>, llvm::detail::DenseSetPair<llvm::Instruction*> >, llvm::DenseMapInfo<llvm::Instruction*> >> = {
          TheMap = {<llvm::DenseMapBase<llvm::SmallDenseMap<llvm::Instruction*, llvm::detail::DenseSetEmpty, 16, llvm::DenseMapInfo<llvm::Instruction*>, llvm::detail::DenseSetPair<llvm::Instruction*> >, llvm::Instruction*, llvm::detail::DenseSetEmpty, llvm::DenseMapInfo<llvm::Instruction*>, llvm::detail::DenseSetPair<llvm::Instruction*> >> = {<llvm::DebugEpochBase> = {
                Epoch = 0}, <No data fields>}, Small = 1, NumEntries = 0, NumTombstones = 0, storage = {
              buffer = "\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377\000\360\377\377\377\377\377\377"}}}, <No data fields>},
      vector_ = {<llvm::SmallVectorImpl<llvm::Instruction*>> = {<llvm::SmallVectorTemplateBase<llvm::Instruction*, true>> = {<llvm::SmallVectorTemplateCommon<llvm::Instruction*, void>> = {<llvm::SmallVectorBase<unsigned int>> = {BeginX = 0x55f64c4127f0, Size = 0,
                Capacity = 16}, <No data fields>}, <No data fields>}, <No data fields>}, <llvm::SmallVectorStorage<llvm::Instruction*, 16>> = {
          InlineElts = "\242O\aF\235-\031\365\000\000\000\000\237\067\361\374\246\065\246H\247\337^b", '\000' <repeats 16 times>, "\235\361\t뭇AZ\256s7\000\256\363\371\213\260-\323خ\037\223A\254\177\017Y\260W\231D\230IÂ\264\363\361\236\266\361\303/\267s\316\n\270IA\000\270\005\031+\000\000\000\000\254\265\230\367", '\000' <repeats 12 times>, "\222E\023\365\300\373\367E\235\305\320", <incomplete sequence \361>}, <No data fields>}}, <No data fields>}}

I have no idea what to make of the Worklist. I suppose C++ folk and Julia programmers can bond over excessively deep nested parametric types.

Note that it's the Iteration > InfiniteLoopDetectionThreshold check that fails. InfiniteLoopDetectionThreshold is some sort of class, and I can't actually evaluate Iteration > InfiniteLoopDetectionThreshold in the gdb REPL.

I can step forward a few times to get to the MadeIRChange |= prepareICWorklistFromFunction(F, DL, &TLI, Worklist) line:

(rr) s
1391      DataType getValue() const { return Value; }
(rr) s
3903        if (Iteration > MaxIterations) {
(rr) s
3910        LLVM_DEBUG(dbgs() << "\n\nINSTCOMBINE ITERATION #" << Iteration << " on "
(rr) s
3913        MadeIRChange |= prepareICWorklistFromFunction(F, DL, &TLI, Worklist);
(rr) p MadeIRChange
$18 = false

and it apparently returned false. MadeIRChange doesn't seem to be used directly for anything in this loop. It just gets for the sake of returning at the end of the function.

Now, when we step again, we enter the function prepareICWorklistFromFunction. I guess it's part of the InstCombinerImpl constructor. For reference, this is what the function looks like

/// Populate the IC worklist from a function, by walking it in depth-first
/// order and adding all reachable code to the worklist.
///
/// This has a couple of tricks to make the code faster and more powerful.  In
/// particular, we constant fold and DCE instructions as we go, to avoid adding
/// them to the worklist (this significantly speeds up instcombine on code where
/// many instructions are dead or constant).  Additionally, if we find a branch
/// whose condition is a known constant, we only visit the reachable successors.
static bool prepareICWorklistFromFunction(Function &F, const DataLayout &DL,
                                          const TargetLibraryInfo *TLI,
                                          InstCombineWorklist &ICWorklist) {
  bool MadeIRChange = false;
  SmallPtrSet<BasicBlock *, 32> Visited;
  SmallVector<BasicBlock*, 256> Worklist;
  Worklist.push_back(&F.front());

  SmallVector<Instruction*, 128> InstrsForInstCombineWorklist;
  DenseMap<Constant *, Constant *> FoldedConstants;

  do {
    BasicBlock *BB = Worklist.pop_back_val();

    // We have now visited this block!  If we've already been here, ignore it.
    if (!Visited.insert(BB).second)
      continue;

    for (BasicBlock::iterator BBI = BB->begin(), E = BB->end(); BBI != E; ) {
      Instruction *Inst = &*BBI++;

      // ConstantProp instruction if trivially constant.
      if (!Inst->use_empty() &&
          (Inst->getNumOperands() == 0 || isa<Constant>(Inst->getOperand(0))))
        if (Constant *C = ConstantFoldInstruction(Inst, DL, TLI)) {
          LLVM_DEBUG(dbgs() << "IC: ConstFold to: " << *C << " from: " << *Inst
                            << '\n');
          Inst->replaceAllUsesWith(C);
          ++NumConstProp;
          if (isInstructionTriviallyDead(Inst, TLI))
            Inst->eraseFromParent();
          MadeIRChange = true;
          continue;
        }

      // See if we can constant fold its operands.
      for (Use &U : Inst->operands()) {
        if (!isa<ConstantVector>(U) && !isa<ConstantExpr>(U))
          continue;

        auto *C = cast<Constant>(U);
        Constant *&FoldRes = FoldedConstants[C];
        if (!FoldRes)
          FoldRes = ConstantFoldConstant(C, DL, TLI);

        if (FoldRes != C) {
          LLVM_DEBUG(dbgs() << "IC: ConstFold operand of: " << *Inst
                            << "\n    Old = " << *C
                            << "\n    New = " << *FoldRes << '\n');
          U = FoldRes;
          MadeIRChange = true;
        }
      }

      // Skip processing debug intrinsics in InstCombine. Processing these call instructions
      // consumes non-trivial amount of time and provides no value for the optimization.
      if (!isa<DbgInfoIntrinsic>(Inst))
        InstrsForInstCombineWorklist.push_back(Inst);
    }

    // Recursively visit successors.  If this is a branch or switch on a
    // constant, only visit the reachable successor.
    Instruction *TI = BB->getTerminator();
    if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {
      if (BI->isConditional() && isa<ConstantInt>(BI->getCondition())) {
        bool CondVal = cast<ConstantInt>(BI->getCondition())->getZExtValue();
        BasicBlock *ReachableBB = BI->getSuccessor(!CondVal);
        Worklist.push_back(ReachableBB);
        continue;
      }
    } else if (SwitchInst *SI = dyn_cast<SwitchInst>(TI)) {
      if (ConstantInt *Cond = dyn_cast<ConstantInt>(SI->getCondition())) {
        Worklist.push_back(SI->findCaseValue(Cond)->getCaseSuccessor());
        continue;
      }
    }

    for (BasicBlock *SuccBB : successors(TI))
      Worklist.push_back(SuccBB);
  } while (!Worklist.empty());

  // Remove instructions inside unreachable blocks. This prevents the
  // instcombine code from having to deal with some bad special cases, and
  // reduces use counts of instructions.
  for (BasicBlock &BB : F) {
    if (Visited.count(&BB))
      continue;

    unsigned NumDeadInstInBB;
    unsigned NumDeadDbgInstInBB;
    std::tie(NumDeadInstInBB, NumDeadDbgInstInBB) =
        removeAllNonTerminatorAndEHPadInstructions(&BB);

    MadeIRChange |= NumDeadInstInBB + NumDeadDbgInstInBB > 0;
    NumDeadInst += NumDeadInstInBB;
  }

  // Once we've found all of the instructions to add to instcombine's worklist,
  // add them in reverse order.  This way instcombine will visit from the top
  // of the function down.  This jives well with the way that it adds all uses
  // of instructions to the worklist after doing a transformation, thus avoiding
  // some N^2 behavior in pathological cases.
  ICWorklist.reserve(InstrsForInstCombineWorklist.size());
  for (Instruction *Inst : reverse(InstrsForInstCombineWorklist)) {
    // DCE instruction if trivially dead. As we iterate in reverse program
    // order here, we will clean up whole chains of dead instructions.
    if (isInstructionTriviallyDead(Inst, TLI)) {
      ++NumDeadInst;
      LLVM_DEBUG(dbgs() << "IC: DCE: " << *Inst << '\n');
      salvageDebugInfo(*Inst);
      Inst->eraseFromParent();
      MadeIRChange = true;
      continue;
    }

    ICWorklist.push(Inst);
  }

  return MadeIRChange;
}

That function also has a bool MadeIRChange = false variable, but we can't actually print it

(rr) p MadeIRChange
$20 = <optimized out>

I set a series of extra break points, start of do { } while (!Worklist.empty()); block

 b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3759

The ++NumConstProp; line, see if MadeIRChange triggers here.

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3774

The U = FoldRes; line, triggered if FodRes != C and also sets MadeIRChange = true.

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3795

The end of the loop.

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3826

Now, continuing again, I see that MadeIRChange does exist once we enter the loop.

(rr) c
Continuing.

Breakpoint 8, prepareICWorklistFromFunction (F=..., DL=..., TLI=TLI@entry=0x55f64c3e62d8, ICWorklist=...)
    at /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3764
3764        for (BasicBlock::iterator BBI = BB->begin(), E = BB->end(); BBI != E; ) {
(rr) p MadeIRChange
$21 = false

But I'm not really sure how to read the basic block

$23 = {<llvm::Value> = {VTy = 0x55f64c404d68, UseList = 0x0, SubclassID = 20 '\024', HasValueHandle = 0 '\000', SubclassOptionalData = 0 '\000', SubclassData = 0, NumUserOperands = 0,
    IsUsedByMD = 0, HasName = 1, HasMetadata = 0, HasHungOffUses = 0, HasDescriptor = 0, static MaxAlignmentExponent = 29,
    static MaximumAlignment = 536870912}, <llvm::ilist_node_with_parent<llvm::BasicBlock, llvm::Function>> = {<llvm::ilist_node<llvm::BasicBlock>> = {<llvm::ilist_node_impl<llvm::ilist_detail::node_options<llvm::BasicBlock, true, false, void> >> = {<llvm::ilist_node_base<true>> = {PrevAndSentinel = {Value = 94516329818528},
          Next = 0x55f64c43c2f8}, <No data fields>}, <No data fields>}, <No data fields>},
  InstList = {<llvm::iplist_impl<llvm::simple_ilist<llvm::Instruction>, llvm::SymbolTableListTraits<llvm::Instruction> >> = {<llvm::SymbolTableListTraits<llvm::Instruction>> = {<llvm::ilist_alloc_traits<llvm::Instruction>> = {<No data fields>}, <No data fields>}, <llvm::simple_ilist<llvm::Instruction>> = {<llvm::ilist_base<true>> = {<No data fields>}, <llvm::ilist_detail::SpecificNodeAccess<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void> >> = {<llvm::ilist_detail::NodeAccess> = {<No data fields>}, <No data fields>},
        Sentinel = {<llvm::ilist_node_impl<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void> >> = {<llvm::ilist_node_base<true>> = {PrevAndSentinel = {
                Value = 94516329825292}, Next = 0x55f64c43abf8}, <No data fields>}, <No data fields>}}, <No data fields>}, <No data fields>}, Parent = 0x55f64c43a958}

At least it doesn't contain buffers like the Worklist earlier. Let's move on. I continue a few times, printing the Worklist, mostly just taking a look at MadeIRChange (to make sure I didn't miss it turning true) and Worklist to see if Size = ... changes

(rr) p MadeIRChange
$24 = false
(rr) p Worklist
$25 = {<llvm::SmallVectorImpl<llvm::BasicBlock*>> = {<llvm::SmallVectorTemplateBase<llvm::BasicBlock*, true>> = {<llvm::SmallVectorTemplateCommon<llvm::BasicBlock*, void>> = {<llvm::SmallVectorBase<unsigned int>> = {BeginX = 0x7ffc5e6b8130, Size = 1, Capacity = 256}, <No data fields>}, <No data fields>}, <No data fields>}, <llvm::SmallVectorStorage<llvm::BasicBlock*, 256>> = {
    InlineElts = "\000\272CL\366U\000\000\340\302CL\366U\000\000hxEL\366U\000\000]\320PU!\177\000\000\004\000\000\000\000\000\000\000\260\201k^\374\177\000\000\b`GL\366U\000\000\311\365\207U!\177", '\000' <repeats 11 times>, "^\377pn\034\246\305@\203k^\374\177\000\000\002\000\000\000\000\000\000\000P\204k^\374\177\000\000\200\203k^\374\177\000\000@\203k^\374\177\000\000\060\000\000\000\000\000\000\000hxEL\366U\000\000\252V\327H\366U\000\000\000\000\000\000\000\000\000\000ȕEL\366U\000\000P\201EL\366U\000\000\030\000\000\000\000\000\000\000\240\344\374F\366U\000\000p\344\374F\366U\000\000 \000\000\000\000\000\000\000 "...}, <No data fields>}
(rr) c
Continuing.

Breakpoint 8, prepareICWorklistFromFunction (F=..., DL=..., TLI=TLI@entry=0x55f64c3e62d8, ICWorklist=...)
    at /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3764
3764        for (BasicBlock::iterator BBI = BB->begin(), E = BB->end(); BBI != E; ) {
(rr) p Worklist
$26 = {<llvm::SmallVectorImpl<llvm::BasicBlock*>> = {<llvm::SmallVectorTemplateBase<llvm::BasicBlock*, true>> = {<llvm::SmallVectorTemplateCommon<llvm::BasicBlock*, void>> = {<llvm::SmallVectorBase<unsigned int>> = {BeginX = 0x7ffc5e6b8130, Size = 1, Capacity = 256}, <No data fields>}, <No data fields>}, <No data fields>}, <llvm::SmallVectorStorage<llvm::BasicBlock*, 256>> = {
    InlineElts = "\000\272CL\366U\000\000 \305CL\366U\000\000hxEL\366U\000\000]\320PU!\177\000\000\004\000\000\000\000\000\000\000\260\201k^\374\177\000\000\b`GL\366U\000\000\311\365\207U!\177", '\000' <repeats 11 times>, "^\377pn\034\246\305@\203k^\374\177\000\000\002\000\000\000\000\000\000\000P\204k^\374\177\000\000\200\203k^\374\177\000\000@\203k^\374\177\000\000\060\000\000\000\000\000\000\000hxEL\366U\000\000\252V\327H\366U\000\000\000\000\000\000\000\000\000\000ȕEL\366U\000\000P\201EL\366U\000\000\030\000\000\000\000\000\000\000\240\344\374F\366U\000\000p\344\374F\366U\000\000 \000\000\000\000\000\000\000 "...}, <No data fields>}
(rr) c
Continuing.

Breakpoint 8, prepareICWorklistFromFunction (F=..., DL=..., TLI=TLI@entry=0x55f64c3e62d8, ICWorklist=...)
    at /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3764
3764        for (BasicBlock::iterator BBI = BB->begin(), E = BB->end(); BBI != E; ) {
(rr) p MadeIRChange
$27 = false
(rr) p Worklist
$28 = {<llvm::SmallVectorImpl<llvm::BasicBlock*>> = {<llvm::SmallVectorTemplateBase<llvm::BasicBlock*, true>> = {<llvm::SmallVectorTemplateCommon<llvm::BasicBlock*, void>> = {<llvm::SmallVectorBase<unsigned int>> = {BeginX = 0x7ffc5e6b8130, Size = 0, Capacity = 256}, <No data fields>}, <No data fields>}, <No data fields>}, <llvm::SmallVectorStorage<llvm::BasicBlock*, 256>> = {
    InlineElts = "\000\272CL\366U\000\000 \305CL\366U\000\000hxEL\366U\000\000]\320PU!\177\000\000\004\000\000\000\000\000\000\000\260\201k^\374\177\000\000\b`GL\366U\000\000\311\365\207U!\177", '\000' <repeats 11 times>, "^\377pn\034\246\305@\203k^\374\177\000\000\002\000\000\000\000\000\000\000P\204k^\374\177\000\000\200\203k^\374\177\000\000@\203k^\374\177\000\000\060\000\000\000\000\000\000\000hxEL\366U\000\000\252V\327H\366U\000\000\000\000\000\000\000\000\000\000ȕEL\366U\000\000P\201EL\366U\000\000\030\000\000\000\000\000\000\000\240\344\374F\366U\000\000p\344\374F\366U\000\000 \000\000\000\000\000\000\000 "...}, <No data fields>}

There're a few Worklist.push_backs at the end of the loop (one of them itself in another loop) to match the Worklist.pop_back_val() at the top, so the Size of the worklist could change in either direction or not at all from one iteration to the next.

continuing...

(rr) c
Continuing.

Breakpoint 11, prepareICWorklistFromFunction (F=..., DL=..., TLI=TLI@entry=0x55f64c3e62d8, ICWorklist=...)
    at /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3830
3830      for (BasicBlock &BB : F) {
(rr) p MadeIRChange
$29 = false

We made it out of the loop without any MadeIRChanges.

Time for a few new breakpoints. First, in the next loop to see if NumDeadInstInBB + NumDeadDbgInstInBB > 0 sets MadeIRChange to true

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3836

another break after that loop

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3842

a breakpoint in the loop that follows, on the line ++NumDeadInst;

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3853

and finally, another breakpoint at the end of the loop, before the function returns MadeIRChange, so I can refill my 3am coffee and figure out what to do next.

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3863

continuing, we skip the first loop, but end up inside the if statement of the second. Probably should have looked at F first to try and figure out if it doesn't iterate or continues every iteration to have some grasp on what's actually happening. I guess that's what reverse-cont is for? Haven't tried it yet.

The second increments NumDeadInst:

(rr) c
Continuing.

Breakpoint 14, prepareICWorklistFromFunction (F=..., DL=..., TLI=TLI@entry=0x55f64c3e62d8, ICWorklist=...)
    at /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3854
3854          LLVM_DEBUG(dbgs() << "IC: DCE: " << *Inst << '\n');
(rr) p MadeIRChange
$31 = false
(rr) p NumDeadInst
$32 = {<llvm::StatisticBase> = {DebugType = 0x55f649756db0 "instcombine", Name = 0x55f649b60cc3 "NumDeadInst", Desc = 0x55f649b60a08 "Number of dead inst eliminated"},
  Value = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 1}, <No data fields>}, Initialized = {_M_base = {static _S_alignment = 1, _M_i = true}}}

Of course, MadeIRChange gets set true at the end of the block, so it's still false as of line 3854, where we broke. Trying to look at Inst

(rr) p *Inst
$34 = {<llvm::User> = {<llvm::Value> = {VTy = 0x55f64c3e8290, UseList = 0x0, SubclassID = 58 ':', HasValueHandle = 0 '\000', SubclassOptionalData = 0 '\000', SubclassData = 6,
      NumUserOperands = 1, IsUsedByMD = 0, HasName = 0, HasMetadata = 0, HasHungOffUses = 0, HasDescriptor = 0, static MaxAlignmentExponent = 29,
      static MaximumAlignment = 536870912}, <No data fields>}, <llvm::ilist_node_with_parent<llvm::Instruction, llvm::BasicBlock>> = {<llvm::ilist_node<llvm::Instruction>> = {<llvm::ilist_node_impl<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void> >> = {<llvm::ilist_node_base<true>> = {PrevAndSentinel = {Value = 94516329822664},
          Next = 0x55f64c43bb18}, <No data fields>}, <No data fields>}, <No data fields>}, Parent = 0x55f64c43aa90, DbgLoc = {Loc = {Ref = {MD = 0x0}}}, Order = 0}

I see HasMetadata = 0, so maybe this isn't the instruction that has the roots bundle? Whichever instruction Inst is, it seems it's "trivially dead".

(rr) p isInstructionTriviallyDead(Inst, TLI)
$35 = true
(rr) p TLI
$36 = (const llvm::TargetLibraryInfo *) 0x55f64c3e62d8
(rr) p *TLI
$37 = {Impl = 0x55f64c3e6200, OverrideAsUnavailable = {Bits = {<llvm::ArrayRef<unsigned long>> = {Data = 0x55f64c44bc50, Length = 8}, <No data fields>}, Size = 460}}

Continuing again, we end up back on the same line. MadeIRChange is now true, because it was set last time. The new Inst has a different SubclassID and a different VTy pointer, but the Parent and Next pointers are the same. I guess Next is part of a

(rr) c
Continuing.

Breakpoint 14, prepareICWorklistFromFunction (F=..., DL=..., TLI=TLI@entry=0x55f64c3e62d8, ICWorklist=...)
    at /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3854
3854          LLVM_DEBUG(dbgs() << "IC: DCE: " << *Inst << '\n');
(rr) p *Inst
$38 = {<llvm::User> = {<llvm::Value> = {VTy = 0x55f64c3e82b0, UseList = 0x0, SubclassID = 75 'K', HasValueHandle = 0 '\000', SubclassOptionalData = 0 '\000', SubclassData = 0,
      NumUserOperands = 1, IsUsedByMD = 0, HasName = 0, HasMetadata = 0, HasHungOffUses = 0, HasDescriptor = 0, static MaxAlignmentExponent = 29,
      static MaximumAlignment = 536870912}, <No data fields>}, <llvm::ilist_node_with_parent<llvm::Instruction, llvm::BasicBlock>> = {<llvm::ilist_node<llvm::Instruction>> = {<llvm::ilist_node_impl<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void> >> = {<llvm::ilist_node_base<true>> = {PrevAndSentinel = {Value = 94516329822536},
          Next = 0x55f64c43bb18}, <No data fields>}, <No data fields>}, <No data fields>}, Parent = 0x55f64c43aa90, DbgLoc = {Loc = {Ref = {MD = 0x0}}}, Order = 0}
(rr) p MadeIRChange
$39 = true

I end up at that breakpoint 4 times, and the Inst always HasMetadata = 0.

Continuing again, we end up at the top of the loop in combineInstructionsOverFunction. I set a breakpoint at

b /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3918

which is just before the

if (!IC.run())
  break;

Maybe combineInstructionsOverFunction gets called multiple times, only eventually entering a state where it never terminates. Would be nice to know. continuing... Here, we go through prepareICWorklistFromFunction relatively quickly. The first loop popping off the Worklist iterates a few times, but we don't stop in any of the later loops. We end up at line 3919, where

3919        if (!IC.run())
(rr) p MadeIRChange
value has been optimized out
(rr) p IC.run()
No symbol "IC" in current context.

for some reason IC isn't defined, and MadeIRChange is reported as optimized out. Huh. If we continue again, we're back at the top of the loop, with the Iteration count increasing again, so !IC.run() must've been false (i.e., IC.run() must've been true). I assume that means instruction combine ran?

Breakpoint 7, combineInstructionsOverFunction (F=..., Worklist=..., AA=AA@entry=0x55f64c44d530, AC=..., TLI=..., TTI=..., DT=..., ORE=..., BFI=0x0, PSI=0x55f64c412880, MaxIterations=1000,
    LI=0x55f64c41fe90) at /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3895
3895        ++Iteration;
(rr) p Iteration
$67 = 2

From here, I can keep continuing through, and see that nothing appears to be changing. The Worklist is always 4 iterations, and none of the other loops seem to trigger. !IC.run() seemingly never triggers, but I'm incapable of looking at it because IC apparently doesn't exist.

I kept hitting c until I got to

Breakpoint 7, combineInstructionsOverFunction (F=..., Worklist=..., AA=AA@entry=0x55f64c44d530, AC=..., TLI=..., TTI=..., DT=..., ORE=..., BFI=0x0, PSI=0x55f64c412880, MaxIterations=1000,
    LI=0x55f64c41fe90) at /home/chriselrod/Documents/languages/llvm/llvm-project/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp:3895
3895        ++Iteration;
(rr) p Iteration
$78 = 13

So I'm fairly confident that I am in fact in the function call that, were I to keep going, it'd reach 100 and trigger the reported error.

So, now, the next thing to do would be to figure more about what IC is, and why IC.run() is always true?

preames commented 2 years ago

Took a look at what's going on in the reduced IR example that Keno shared. Here's a quick summary of what's causing the infinite loop.

First, here's a slightly more reduced example which illustrates the issue:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:10:11:12:13"
target triple = "x86_64-unknown-linux-gnu"

declare i32 @llvm.x86.avx2.pmovmskb(<32 x i8>) nounwind readnone

define void @julia_f_179(<32 x i8> %a) {
  %v = call i32 @llvm.x86.avx2.pmovmskb(<32 x i8> %a) [ "jl_roots"({}* undef) ]
  ret void
}

$ opt -S -instcombine < example.ll

This does reproduce on tip of tree LLVM.

simplifyX86movmsk visits the intrinsic call and generates non-intrinsic IR which computes the same result. This new result is used to replace all uses of the original instruction. What's expected to happen next is that the intrinsic will be dead (e.g. have no uses and no side effects), and be deleted on the next iteration of instcombine.

However, as you'll note there are no uses of the intrinsic in the example. As a result, what happens instead is that the newly generated canonical IR is dead, it's deleted on the next iteration, and we're back to our starting point. Since both steps "made progress", we have an infinite loop.

The key question is why the intrinsic call isn't simply being deleted. My guess is that the operand bundle is causing us to consider the intrinsic to have side effects, but I haven't located that point in the code just yet. The call chain to examine starts with isInstructionTriviallyDead and I'm starting there now.

preames commented 2 years ago

Here's a summary of what's going on.

The unknown-to-llvm operand bundle on the call can potentially write to memory, and thus is correctly preventing the removal of the call. This points to the need to inhibit the canonicalization transform when there are no uses.

This would seem like an easy fix, but I suspect we'd be fixing one symptom of many. In general, glancing at how LLVM works with intrinsics, the interactions with operand bundles look suspect, and writable operand bundles particularly so. I strongly suspect we could reproduce this basic issue with most any target specific intrinsic.

Not knowing much about the Julia semantics of the operand bundle in question, I checked to see if we had the same problem with operand bundles which only read. It turns out we do, but do to a separate root cause which is a much more clearly complete fix.

If we can get away with marking this operand bundle as readonly, I think we should. It would unblock the Julia use case, and aligns with one of the most tested operand bundles in upstream llvm (deopt). There's still room for uncaught bugs even with readonly operand bundles, but the prevalence is likely to be quite a bit lower.

Now to go figure out if it's reasonable to mark the "jl-roots" bundle readonly...

Keno commented 2 years ago

jl_roots is used to temporarily GC root values and protect them from collection during the callee. It does not otherwise read or write the pointers passed to it, so I think it's correct to mark it readnone. See documentation on GC root placement here: https://github.com/JuliaLang/julia/blob/master/doc/src/devdocs/llvm.md#gc-root-placement

preames commented 2 years ago

@Keno Thanks! Had stumbled across that page the other day, but hadn't read it in detail yet. Was having trouble finding it again, so you just saved me a bunch of time.

preames commented 2 years ago

Ok, after investigating the use of jl_roots, and talking to Keno, I'm pretty sure that marking that bundle as readnone is valid.

I've attached two patches (as txt files, because github rejects .diff and .patch) one for readonly and one for readnone. I recommend using the readonly patch because it's slightly less risky.

The risk I see with the readnone patch comes down to lack of knowledge on my part about how julia models GC objects. If the frontend were to lower a GC allocation as:

  %alloc = call i8* @some_uninit_allocation_fn()
  call void @memset(i8* %alloc, i8 0, i64 %len)

Then an otherwise dead allocation might have the memset removed. By leaving the jl_roots bundle as reading, we prevent the optimizer from recognizing this.

If instead, julia lowers to a single initializing allocation call (e.g. calloc like), then using readnone is probably fine.

For clarity, the reason this matters is that most GC implementations I've run across get very confused (and generally crash horribly) if encountering uninitialized memory in a place they expect to find a valid reference.

I've been able to test these locally on opt with the reduced IR examples, but have not done end to end testing for julia.

Reminder: These depend on applying the changes which fix llvm/llvm-project#53270. Without those, opt will still crash.

Separately, I recommend that we change the frontend to not emit operand bundles on calls to LLVM target intrinsics. What the semantics of an operand bundle on an intrinsic is, is currently a tad unclear. There's a strong possibility that we change the langref to disallow these eventually, and in the meantime, removing them is a useful defense in depth fix for the reported crash.

JuliaRootsReadNone.txt JuliaRootsReadOnly.txt

giordano commented 2 months ago

Is this still relevant? I can't reproduce on 5230d27de95 (with LLVM 18) any of the issues reported at the top of this ticket:

julia> const ZERO_v256 = ntuple(i -> VecElement{UInt8}(0x00), 32);^C

julia> const v256 = NTuple{32, VecElement{UInt8}}
NTuple{32, VecElement{UInt8}}

julia> const v256 = NTuple{32, VecElement{UInt8}};

julia> const ZERO_v256 = ntuple(i -> VecElement{UInt8}(0x00), 32);

julia> @inline function vpcmpeqb(a::v256, b::v256)
           Base.lvmcall( """%res = icmp eq <32 x i8> %0, %1   # NB! NOTE TYPO IN "LVMCALL"
           %resb = sext <32 x i1> %res to <32 x i8>
           ret <32 x i8> %resb
           """, v256, Tuple{v256, v256}, a, b)
       end
vpcmpeqb (generic function with 1 method)

julia> @inline function vpmovmskb(v::v256)
           eqzero = vpcmpeqb(v, ZERO_v256)
           packed = ccall("llvm.x86.avx2.pmovmskb", llvmcall, UInt32, (NTuple{32, VecElement{UInt8}},), eqzero)
           return leading_ones(packed)
       end
vpmovmskb (generic function with 1 method)

julia> vpmovmskb(ZERO_v256) # this now errors out without hanging
ERROR: UndefVarError: `lvmcall` not defined in `Base`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] vpcmpeqb
   @ ./REPL[4]:2 [inlined]
 [2] vpmovmskb(v::NTuple{32, VecElement{UInt8}})
   @ Main ./REPL[5]:2
 [3] top-level scope
   @ REPL[6]:1

julia> v = ntuple(Core.VecElement{UInt8}, Val(32));

julia> ccall("llvm.x86.avx2.pmovmskb", llvmcall, UInt32, (NTuple{32, VecElement{UInt8}},), v)
0x00000000

JuliaLang / julia

Hang when generating LLVM code with Julia typo #38226