Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Long execution time for large IR module #22629

Open Quuxplusone opened 9 years ago

Quuxplusone commented 9 years ago
Bugzilla Link PR22630
Status NEW
Importance P normal
Reported by David Jones (david.jones@metrics.ca)
Reported on 2015-02-18 20:08:38 -0800
Last modified on 2015-02-19 17:00:45 -0800
Version 3.5
Hardware PC Linux
CC david.majnemer@gmail.com, dexonsmith@apple.com, llvm-bugs@lists.llvm.org, rafael@espindo.la, rnk@google.com
Fixed by commit(s)
Attachments timing (13218 bytes, application/octet-stream)
Blocks
Blocked by
See also
Created attachment 13903
Timing report

I have a rather large (71MB) IR assembly file (includes debug metadata) that
takes 100 seconds to optimize and generate x86-64 code for. (Host is AMD K10
running at 2.7-3.3GHz.)

If I run:

/tools/llvm/3.5.1dbg/bin/llc -O2  LMtb_ethernet.ll -o opt.o -time-passes

the pass timing report (attached) accounts for about 40 seconds of run time,
even though the command takes about 100 seconds.

In comparison, clang takes only 4-6 seconds to compile my largest C++ file,
which results in 61MB of LLVM IR assembly (again, including debug).

I'm not sure what's actually taking up the time. If I run opt instead of llc,
opt will finish in about 40 seconds.  Is this a case of LLVM backend behaving
reasonably on very large output, or is there something pathological in either
my code or LLVM that one of us can fix?

It should be noted that the file I am trying to compile is by no means
considered "large" in my industry.
Quuxplusone commented 9 years ago

Attached timing (13218 bytes, application/octet-stream): Timing report

Quuxplusone commented 9 years ago

Hmm... it appears that the bitcode is too large to attach here. Even assembled to bitcode and gzipped, it's 4.6MB.

Is there a way I could share it, e.g. via DropBox? I don't have a convenient web server to post it to.

Quuxplusone commented 9 years ago

DropBox would be fine.

Quuxplusone commented 9 years ago

Duncan recently fixed a metadata reading performance issue.

Quuxplusone commented 9 years ago

Metadata isn't the culprit here.

I locally regenerated the IR without any debug metadata (I can just flip a switch in the compiler.)

It saves about 10 seconds off the 1:40 run time.

Quuxplusone commented 9 years ago

What is the time of

opt -disable-output -disable-verify test.bc

?

Quuxplusone commented 9 years ago

About one second.

BTW, bitcode parsing is not the issue either. "In-system", the IR is built in memory using the C++ API, and optimization/code generation is done from that.

Quuxplusone commented 9 years ago

Can you profile llc with perf or some other tool and give the top offender? We have a couple issues open for backend pathological cases that we could dupe this against, like PR17409 and PR21349.

Quuxplusone commented 9 years ago
With debug metadata:

+  23.07%  llc  libLLVM-3.5.so       [.]
llvm::SlotTracker::CreateMetadataSlot(llvm::MDNode const*)
+  22.87%  llc  libLLVM-3.5.so       [.] llvm::SlotTracker::processModule()
+   9.80%  llc  libLLVM-3.5.so       [.] llvm::DenseMap<llvm::Value const*,
unsigned int, llvm::Dens
+   4.81%  llc  libLLVM-3.5.so       [.] llvm::DenseMap<llvm::MDNode const*,
unsigned int, llvm::Den
+   3.10%  llc  libLLVM-3.5.so       [.] llvm::MDNode::getOperand(unsigned int)
const
+   2.97%  llc  libLLVM-3.5.so       [.]
llvm::DenseMapBase<llvm::DenseMap<llvm::Value const*, unsig
+   2.09%  llc  libLLVM-3.5.so       [.]
llvm::MachineRegisterInfo::clearKillFlags(unsigned int) con
+   0.95%  llc  libLLVM-3.5.so       [.] _ZNK4llvm6MDNode10getOperandEj@plt
+   0.87%  llc  libc-2.19.so         [.] _int_malloc
+   0.86%  llc  libLLVM-3.5.so       [.]
llvm::LiveVariables::MarkVirtRegAliveInBlock(llvm::LiveVari
+   0.79%  llc  [kernel.kallsyms]    [k] 0xffffffff8104f45a
+   0.66%  llc  libLLVM-3.5.so       [.]
llvm::SmallPtrSetImplBase::insert_imp(void const*)
+   0.63%  llc  libLLVM-3.5.so       [.] llvm::AttributeSet::getFnAttributes()
const
+   0.61%  llc  libLLVM-3.5.so       [.]
llvm::DenseMapBase<llvm::DenseMap<llvm::MDNode const*, unsi
+   0.58%  llc  libLLVM-3.5.so       [.] llvm::LLLexer::LexIdentifier()
+   0.51%  llc  libLLVM-3.5.so       [.]
llvm::LiveRange::createDeadDef(llvm::SlotIndex, llvm::BumpP
+   0.46%  llc  libc-2.19.so         [.] _int_free
+   0.41%  llc  libLLVM-3.5.so       [.]
llvm::LiveRange::extendSegmentEndTo(llvm::LiveRange::Segmen
+   0.41%  llc  libLLVM-3.5.so       [.] llvm::LLLexer::LexToken()
+   0.40%  llc  libLLVM-3.5.so       [.]
llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, uns
+   0.32%  llc  libc-2.19.so         [.] malloc_consolidate
+   0.24%  llc  libLLVM-3.5.so       [.]
llvm::SmallPtrSetImplBase::erase_imp(void const*)
+   0.24%  llc  libc-2.19.so         [.] memset
+   0.23%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::DAGCombiner::visit(llvm::SDNode*)
+   0.22%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::JoinVals::computeAssignment(unsigned
+   0.22%  llc  libLLVM-3.5.so       [.]
llvm::MachineInstr::isIdenticalTo(llvm::MachineInstr const*
+   0.21%  llc  libLLVM-3.5.so       [.]
llvm::X86InstrInfo::AnalyzeBranch(llvm::MachineBasicBlock&,
+   0.21%  llc  libc-2.19.so         [.] malloc
+   0.20%  llc  libLLVM-3.5.so       [.]
llvm::TargetLowering::SimplifyDemandedBits(llvm::SDValue, l

Without debug metadata:

+  29.77%  llc  libLLVM-3.5.so       [.] llvm::SlotTracker::processModule()
+  16.39%  llc  libLLVM-3.5.so       [.] llvm::DenseMap<llvm::Value const*,
unsigned int, llvm::Dens
+   4.23%  llc  libLLVM-3.5.so       [.]
llvm::DenseMapBase<llvm::DenseMap<llvm::Value const*, unsig
+   3.65%  llc  libLLVM-3.5.so       [.]
llvm::MachineRegisterInfo::clearKillFlags(unsigned int) con
+   1.35%  llc  libLLVM-3.5.so       [.]
llvm::LiveVariables::MarkVirtRegAliveInBlock(llvm::LiveVari
+   1.15%  llc  libLLVM-3.5.so       [.]
llvm::SmallPtrSetImplBase::insert_imp(void const*)
+   1.11%  llc  libc-2.19.so         [.] _int_malloc
+   0.96%  llc  libLLVM-3.5.so       [.] llvm::LLLexer::LexIdentifier()
+   0.90%  llc  libLLVM-3.5.so       [.]
llvm::LiveRange::createDeadDef(llvm::SlotIndex, llvm::BumpP
+   0.84%  llc  [kernel.kallsyms]    [k] 0xffffffff8104f45a
+   0.81%  llc  libLLVM-3.5.so       [.] llvm::AttributeSet::getFnAttributes()
const
+   0.73%  llc  libLLVM-3.5.so       [.]
llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, uns
+   0.72%  llc  libLLVM-3.5.so       [.]
llvm::LiveRange::extendSegmentEndTo(llvm::LiveRange::Segmen
+   0.60%  llc  libc-2.19.so         [.] _int_free
+   0.56%  llc  libc-2.19.so         [.] malloc_consolidate
+   0.52%  llc  libLLVM-3.5.so       [.] llvm::LLLexer::LexToken()
+   0.40%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::DAGCombiner::visit(llvm::SDNode*)
+   0.39%  llc  libLLVM-3.5.so       [.]
llvm::X86InstrInfo::AnalyzeBranch(llvm::MachineBasicBlock&,
+   0.38%  llc  libLLVM-3.5.so       [.]
llvm::SmallPtrSetImplBase::erase_imp(void const*)
+   0.38%  llc  libLLVM-3.5.so       [.]
llvm::MachineInstr::isIdenticalTo(llvm::MachineInstr const*
+   0.38%  llc  libLLVM-3.5.so       [.]
llvm::TargetLowering::SimplifyDemandedBits(llvm::SDValue, l
+   0.37%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::JoinVals::computeAssignment(unsigned
+   0.36%  llc  libc-2.19.so         [.] memset
+   0.31%  llc  libLLVM-3.5.so       [.]
llvm::MachineInstr::addOperand(llvm::MachineFunction&, llvm
+   0.30%  llc  libLLVM-3.5.so       [.]
llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4
+   0.29%  llc  libc-2.19.so         [.] malloc
+   0.29%  llc  libLLVM-3.5.so       [.] llvm::sys::MemoryFence()
+   0.29%  llc  libLLVM-3.5.so       [.]
llvm::MachineInstr::addRegisterKilled(unsigned int, llvm::T
+   0.28%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::DAGCombiner::visitSTORE(llvm::SDNode
+   0.26%  llc  libLLVM-3.5.so       [.] llvm::LiveRange::find(llvm::SlotIndex)
+   0.25%  llc  libLLVM-3.5.so       [.]
llvm::GetUnderlyingObject(llvm::Value*, llvm::DataLayout co
+   0.25%  llc  libLLVM-3.5.so       [.]
llvm::IntegerType::get(llvm::LLVMContext&, unsigned int)
+   0.25%  llc  libc-2.19.so         [.] __memcmp_sse2
+   0.24%  llc  libLLVM-3.5.so       [.] llvm::hashing::detail::hash_short(char
const*, unsigned lon
+   0.24%  llc  libLLVM-3.5.so       [.] llvm::Use::getUser() const
+   0.23%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::Verifier::visitInstruction(llvm::Ins
+   0.23%  llc  libLLVM-3.5.so       [.]
llvm::FoldingSetNodeID::AddInteger(unsigned int)
+   0.22%  llc  libLLVM-3.5.so       [.]
llvm::LiveVariables::VarInfo::findKill(llvm::MachineBasicBl
+   0.21%  llc  libLLVM-3.5.so       [.]
llvm::LiveVariables::runOnMachineFunction(llvm::MachineFunc
+   0.21%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::MachineCopyPropagation::runOnMachine
+   0.21%  llc  libLLVM-3.5.so       [.]
llvm::MachineOperand::isIdenticalTo(llvm::MachineOperand co
+   0.21%  llc  libLLVM-3.5.so       [.]
llvm::SelectionDAG::computeKnownBits(llvm::SDValue, llvm::A
+   0.20%  llc  libLLVM-3.5.so       [.]
llvm::MachineInstr::isSafeToMove(llvm::TargetInstrInfo cons
+   0.20%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::VirtRegRewriter::runOnMachineFunctio
+   0.20%  llc  libLLVM-3.5.so       [.]
llvm::FoldingSetImpl::FindNodeOrInsertPos(llvm::FoldingSetN
+   0.20%  llc  libLLVM-3.5.so       [.] (anonymous
namespace)::CodeGenPrepare::OptimizeInst(llvm::I
+   0.20%  llc  libLLVM-3.5.so       [.]
llvm::SelectionDAG::MorphNodeTo(llvm::SDNode*, unsigned int
+   0.19%  llc  libLLVM-3.5.so       [.]
llvm::LiveRangeCalc::extendToUses(llvm::LiveRange&, unsigne
+   0.19%  llc  libLLVM-3.5.so       [.] llvm::DAGTypeLegalizer::run()
+   0.19%  llc  libLLVM-3.5.so       [.]
llvm::X86InstrInfo::isUnpredicatedTerminator(llvm::MachineI
+   0.18%  llc  libLLVM-3.5.so       [.] bool
llvm::DenseMapBase<llvm::DenseMap<llvm::DenseMapAPIntK
+   0.18%  llc  libLLVM-3.5.so       [.]
llvm::LiveRange::extendInBlock(llvm::SlotIndex, llvm::SlotI
+   0.18%  llc  libLLVM-3.5.so       [.]
llvm::formatted_raw_ostream::PadToColumn(unsigned int)
+   0.17%  llc  libLLVM-3.5.so       [.]
llvm::FoldingSetNodeID::AddPointer(void const*)
+   0.16%  llc  libLLVM-3.5.so       [.]
llvm::TargetLoweringBase::getTypeConversion(llvm::LLVMConte
+   0.16%  llc  libc-2.19.so         [.] __memcpy_sse2_unaligned

In both cases, the command to profile is:

perf record -g /tools/llvm/3.5.1rel/bin/llc -O2  LMtb_ethernet.ll -o opt.o

I am not using ASan or UBSan.
Quuxplusone commented 9 years ago

(In reply to comment #8)

With debug metadata:

  • 23.07% llc libLLVM-3.5.so [.] llvm::SlotTracker::CreateMetadataSlot(llvm::MDNode const*)
  • 22.87% llc libLLVM-3.5.so [.] llvm::SlotTracker::processModule()
  • 9.80% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::Value const*, unsigned int, llvm::Dens
  • 4.81% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::MDNode const*, unsigned int, llvm::Den

SlotTracker is only used in AsmWriter, so something is writing LLVM assembly somewhere. If you add -debug, this typically happens, and will be very slow.

You should set a breakpoint on llvm::SlotTracker::processModule() and send a backtrace. If you're not specifying -debug on the command-line, nothing should be calling the LLVM assembly writer.

  • 3.10% llc libLLVM-3.5.so [.] llvm::MDNode::getOperand(unsigned int) const

This is kind of fishy. This shouldn't really be that slow. Although maybe it used to be (this is 3.5.1, right?).

Without debug metadata:

  • 29.77% llc libLLVM-3.5.so [.] llvm::SlotTracker::processModule()
  • 16.39% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::Value const*, unsigned int, llvm::Dens
  • 4.23% llc libLLVM-3.5.so [.] llvm::DenseMapBase<llvm::DenseMap<llvm::Value const*, unsig
  • 3.65% llc libLLVM-3.5.so [.]

This just looks like the above with the metadata parts no longer showing up. Same problem: someone is writing LLVM assembly where they shouldn't.

Quuxplusone commented 9 years ago
(In reply to comment #3)
> Duncan recently fixed a metadata reading performance issue.

BTW, that issue never made it into a release (I introduced the problem *after*
the 3.5 branch).
Quuxplusone commented 9 years ago

(In reply to comment #9)

SlotTracker is only used in AsmWriter, so something is writing LLVM assembly somewhere. If you add -debug, this typically happens, and will be very slow.

You should set a breakpoint on llvm::SlotTracker::processModule() and send a backtrace. If you're not specifying -debug on the command-line, nothing should be calling the LLVM assembly writer.

BTW, this kind of pathological performance could be a single call to Value::print() or Value::dump(), since that rolls up a SlotTracker for the entire module. Put that in a tight loop and explode. (-debug or -debug-only=<some-pass> reliably triggers this since the otherwise no-op DEBUG() calls contain a lot of dump() and print() calls.)

Quuxplusone commented 9 years ago
Hmm... the problem with opt was that I didn't provide '-filetype=obj'. With
that, the actual time matches what -time-passes prints: about 40 seconds.

This leaves me puzzled, as my compiler still takes 80 seconds to generate all
the code. As a test, I commented out the actual call into LLVM to optimize and
generate each object. With that out, the rest of my compiler (parsing, semantic
analysis, elaboration, passes prior to final code generation, code generation,
IR printing, etc.) takes 8 seconds.

I also timed llc on each object file generated by my compiler.  With the
exception of the bitcode I've attached to this bug, compile times for the
others were 1-2 seconds in total.

Here's a perf report of my compiler:

+   6.77%  lt-dsim  libLLVM-3.5.so         [.]
llvm::DominatorTreeBase<llvm::BasicBlock>::dominates(llvm::BasicBlock const*,
llvm::BasicBlock const*) con
+   4.48%  lt-dsim  libLLVM-3.5.so         [.] llvm::LiveRange::verify() const
+   2.33%  lt-dsim  libLLVM-3.5.so         [.]
llvm::LiveRange::extendSegmentEndTo(llvm::LiveRange::Segment*, llvm::SlotIndex)
+   1.94%  lt-dsim  libLLVM-3.5.so         [.]
llvm::MachineRegisterInfo::clearKillFlags(unsigned int) const
+   1.80%  lt-dsim  libLLVM-3.5.so         [.]
llvm::SmallPtrSetImplBase::insert_imp(void const*)
+   1.76%  lt-dsim  libLLVM-3.5.so         [.]
llvm::InstCombiner::DoOneIteration(llvm::Function&, unsigned int)
+   1.74%  lt-dsim  libLLVM-3.5.so         [.] llvm::sys::MemoryFence()
+   1.48%  lt-dsim  libLLVM-3.5.so         [.]
llvm::PassRegistry::getPassInfo(void const*) const
+   1.32%  lt-dsim  libLLVM-3.5.so         [.]
llvm::SmallVectorImpl<llvm::LiveRange::Segment>::insert(llvm::LiveRange::Segment*,
llvm::LiveRange::Segmen
+   1.31%  lt-dsim  libc-2.19.so           [.] _int_malloc
+   1.22%  lt-dsim  libLLVM-3.5.so         [.]
llvm::computeKnownBits(llvm::Value*, llvm::APInt&, llvm::APInt&,
llvm::DataLayout const*, unsigned int)
+   1.21%  lt-dsim  libLLVM-3.5.so         [.]
llvm::PMDataManager::findAnalysisPass(void const*, bool)
+   1.10%  lt-dsim  libLLVM-3.5.so         [.]
llvm::PMTopLevelManager::findAnalysisPass(void const*)
+   1.09%  lt-dsim  libLLVM-3.5.so         [.]
llvm::LiveRange::find(llvm::SlotIndex)
+   1.06%  lt-dsim  libLLVM-3.5.so         [.]
llvm::InstCombiner::SimplifyDemandedUseBits(llvm::Value*, llvm::APInt,
llvm::APInt&, llvm::APInt&, unsigne
+   0.94%  lt-dsim  libLLVM-3.5.so         [.] llvm::Use::getUser() const
+   0.92%  lt-dsim  [kernel.kallsyms]      [k] 0xffffffff811798ca
+   0.77%  lt-dsim  libc-2.19.so           [.] _int_free
+   0.70%  lt-dsim  libLLVM-3.5.so         [.]
llvm::LiveRange::extendInBlock(llvm::SlotIndex, llvm::SlotIndex)
+   0.70%  lt-dsim  libLLVM-3.5.so         [.]
llvm::PassRegistry::getPassRegistry()
+   0.66%  lt-dsim  libLLVM-3.5.so         [.]
llvm::SwitchInst::getSuccessor(unsigned int) const
+   0.65%  lt-dsim  libc-2.19.so           [.] memset
+   0.64%  lt-dsim  libLLVM-3.5.so         [.]
llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const
+   0.63%  lt-dsim  libLLVM-3.5.so         [.]
llvm::GetUnderlyingObject(llvm::Value*, llvm::DataLayout const*, unsigned int)
+   0.60%  lt-dsim  libc-2.19.so           [.] malloc_consolidate
+   0.56%  lt-dsim  libLLVM-3.5.so         [.]
llvm::LiveRange::extendSegmentStartTo(llvm::LiveRange::Segment*,
llvm::SlotIndex)
+   0.53%  lt-dsim  libLLVM-3.5.so         [.]
llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*,
unsigned int)
+   0.52%  lt-dsim  libc-2.19.so           [.] malloc
Quuxplusone commented 9 years ago

(In reply to comment #12)

Hmm... the problem with opt was that I didn't provide '-filetype=obj'. With that, the actual time matches what -time-passes prints: about 40 seconds.

This leaves me puzzled, as my compiler still takes 80 seconds to generate all the code. As a test, I commented out the actual call into LLVM to optimize and generate each object. With that out, the rest of my compiler (parsing, semantic analysis, elaboration, passes prior to final code generation, code generation, IR printing, etc.) takes 8 seconds.

You should to figure out what's calling into AsmWriter and instantiating the SlotTracker, since (1) it's wrong and (2) it's dominating your profile.