Open Quuxplusone opened 9 years ago
Attached timing
(13218 bytes, application/octet-stream): Timing report
Hmm... it appears that the bitcode is too large to attach here. Even assembled to bitcode and gzipped, it's 4.6MB.
Is there a way I could share it, e.g. via DropBox? I don't have a convenient web server to post it to.
DropBox would be fine.
Duncan recently fixed a metadata reading performance issue.
Metadata isn't the culprit here.
I locally regenerated the IR without any debug metadata (I can just flip a switch in the compiler.)
It saves about 10 seconds off the 1:40 run time.
What is the time of
opt -disable-output -disable-verify test.bc
?
About one second.
BTW, bitcode parsing is not the issue either. "In-system", the IR is built in memory using the C++ API, and optimization/code generation is done from that.
With debug metadata:
+ 23.07% llc libLLVM-3.5.so [.]
llvm::SlotTracker::CreateMetadataSlot(llvm::MDNode const*)
+ 22.87% llc libLLVM-3.5.so [.] llvm::SlotTracker::processModule()
+ 9.80% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::Value const*,
unsigned int, llvm::Dens
+ 4.81% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::MDNode const*,
unsigned int, llvm::Den
+ 3.10% llc libLLVM-3.5.so [.] llvm::MDNode::getOperand(unsigned int)
const
+ 2.97% llc libLLVM-3.5.so [.]
llvm::DenseMapBase<llvm::DenseMap<llvm::Value const*, unsig
+ 2.09% llc libLLVM-3.5.so [.]
llvm::MachineRegisterInfo::clearKillFlags(unsigned int) con
+ 0.95% llc libLLVM-3.5.so [.] _ZNK4llvm6MDNode10getOperandEj@plt
+ 0.87% llc libc-2.19.so [.] _int_malloc
+ 0.86% llc libLLVM-3.5.so [.]
llvm::LiveVariables::MarkVirtRegAliveInBlock(llvm::LiveVari
+ 0.79% llc [kernel.kallsyms] [k] 0xffffffff8104f45a
+ 0.66% llc libLLVM-3.5.so [.]
llvm::SmallPtrSetImplBase::insert_imp(void const*)
+ 0.63% llc libLLVM-3.5.so [.] llvm::AttributeSet::getFnAttributes()
const
+ 0.61% llc libLLVM-3.5.so [.]
llvm::DenseMapBase<llvm::DenseMap<llvm::MDNode const*, unsi
+ 0.58% llc libLLVM-3.5.so [.] llvm::LLLexer::LexIdentifier()
+ 0.51% llc libLLVM-3.5.so [.]
llvm::LiveRange::createDeadDef(llvm::SlotIndex, llvm::BumpP
+ 0.46% llc libc-2.19.so [.] _int_free
+ 0.41% llc libLLVM-3.5.so [.]
llvm::LiveRange::extendSegmentEndTo(llvm::LiveRange::Segmen
+ 0.41% llc libLLVM-3.5.so [.] llvm::LLLexer::LexToken()
+ 0.40% llc libLLVM-3.5.so [.]
llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, uns
+ 0.32% llc libc-2.19.so [.] malloc_consolidate
+ 0.24% llc libLLVM-3.5.so [.]
llvm::SmallPtrSetImplBase::erase_imp(void const*)
+ 0.24% llc libc-2.19.so [.] memset
+ 0.23% llc libLLVM-3.5.so [.] (anonymous
namespace)::DAGCombiner::visit(llvm::SDNode*)
+ 0.22% llc libLLVM-3.5.so [.] (anonymous
namespace)::JoinVals::computeAssignment(unsigned
+ 0.22% llc libLLVM-3.5.so [.]
llvm::MachineInstr::isIdenticalTo(llvm::MachineInstr const*
+ 0.21% llc libLLVM-3.5.so [.]
llvm::X86InstrInfo::AnalyzeBranch(llvm::MachineBasicBlock&,
+ 0.21% llc libc-2.19.so [.] malloc
+ 0.20% llc libLLVM-3.5.so [.]
llvm::TargetLowering::SimplifyDemandedBits(llvm::SDValue, l
Without debug metadata:
+ 29.77% llc libLLVM-3.5.so [.] llvm::SlotTracker::processModule()
+ 16.39% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::Value const*,
unsigned int, llvm::Dens
+ 4.23% llc libLLVM-3.5.so [.]
llvm::DenseMapBase<llvm::DenseMap<llvm::Value const*, unsig
+ 3.65% llc libLLVM-3.5.so [.]
llvm::MachineRegisterInfo::clearKillFlags(unsigned int) con
+ 1.35% llc libLLVM-3.5.so [.]
llvm::LiveVariables::MarkVirtRegAliveInBlock(llvm::LiveVari
+ 1.15% llc libLLVM-3.5.so [.]
llvm::SmallPtrSetImplBase::insert_imp(void const*)
+ 1.11% llc libc-2.19.so [.] _int_malloc
+ 0.96% llc libLLVM-3.5.so [.] llvm::LLLexer::LexIdentifier()
+ 0.90% llc libLLVM-3.5.so [.]
llvm::LiveRange::createDeadDef(llvm::SlotIndex, llvm::BumpP
+ 0.84% llc [kernel.kallsyms] [k] 0xffffffff8104f45a
+ 0.81% llc libLLVM-3.5.so [.] llvm::AttributeSet::getFnAttributes()
const
+ 0.73% llc libLLVM-3.5.so [.]
llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, uns
+ 0.72% llc libLLVM-3.5.so [.]
llvm::LiveRange::extendSegmentEndTo(llvm::LiveRange::Segmen
+ 0.60% llc libc-2.19.so [.] _int_free
+ 0.56% llc libc-2.19.so [.] malloc_consolidate
+ 0.52% llc libLLVM-3.5.so [.] llvm::LLLexer::LexToken()
+ 0.40% llc libLLVM-3.5.so [.] (anonymous
namespace)::DAGCombiner::visit(llvm::SDNode*)
+ 0.39% llc libLLVM-3.5.so [.]
llvm::X86InstrInfo::AnalyzeBranch(llvm::MachineBasicBlock&,
+ 0.38% llc libLLVM-3.5.so [.]
llvm::SmallPtrSetImplBase::erase_imp(void const*)
+ 0.38% llc libLLVM-3.5.so [.]
llvm::MachineInstr::isIdenticalTo(llvm::MachineInstr const*
+ 0.38% llc libLLVM-3.5.so [.]
llvm::TargetLowering::SimplifyDemandedBits(llvm::SDValue, l
+ 0.37% llc libLLVM-3.5.so [.] (anonymous
namespace)::JoinVals::computeAssignment(unsigned
+ 0.36% llc libc-2.19.so [.] memset
+ 0.31% llc libLLVM-3.5.so [.]
llvm::MachineInstr::addOperand(llvm::MachineFunction&, llvm
+ 0.30% llc libLLVM-3.5.so [.]
llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4
+ 0.29% llc libc-2.19.so [.] malloc
+ 0.29% llc libLLVM-3.5.so [.] llvm::sys::MemoryFence()
+ 0.29% llc libLLVM-3.5.so [.]
llvm::MachineInstr::addRegisterKilled(unsigned int, llvm::T
+ 0.28% llc libLLVM-3.5.so [.] (anonymous
namespace)::DAGCombiner::visitSTORE(llvm::SDNode
+ 0.26% llc libLLVM-3.5.so [.] llvm::LiveRange::find(llvm::SlotIndex)
+ 0.25% llc libLLVM-3.5.so [.]
llvm::GetUnderlyingObject(llvm::Value*, llvm::DataLayout co
+ 0.25% llc libLLVM-3.5.so [.]
llvm::IntegerType::get(llvm::LLVMContext&, unsigned int)
+ 0.25% llc libc-2.19.so [.] __memcmp_sse2
+ 0.24% llc libLLVM-3.5.so [.] llvm::hashing::detail::hash_short(char
const*, unsigned lon
+ 0.24% llc libLLVM-3.5.so [.] llvm::Use::getUser() const
+ 0.23% llc libLLVM-3.5.so [.] (anonymous
namespace)::Verifier::visitInstruction(llvm::Ins
+ 0.23% llc libLLVM-3.5.so [.]
llvm::FoldingSetNodeID::AddInteger(unsigned int)
+ 0.22% llc libLLVM-3.5.so [.]
llvm::LiveVariables::VarInfo::findKill(llvm::MachineBasicBl
+ 0.21% llc libLLVM-3.5.so [.]
llvm::LiveVariables::runOnMachineFunction(llvm::MachineFunc
+ 0.21% llc libLLVM-3.5.so [.] (anonymous
namespace)::MachineCopyPropagation::runOnMachine
+ 0.21% llc libLLVM-3.5.so [.]
llvm::MachineOperand::isIdenticalTo(llvm::MachineOperand co
+ 0.21% llc libLLVM-3.5.so [.]
llvm::SelectionDAG::computeKnownBits(llvm::SDValue, llvm::A
+ 0.20% llc libLLVM-3.5.so [.]
llvm::MachineInstr::isSafeToMove(llvm::TargetInstrInfo cons
+ 0.20% llc libLLVM-3.5.so [.] (anonymous
namespace)::VirtRegRewriter::runOnMachineFunctio
+ 0.20% llc libLLVM-3.5.so [.]
llvm::FoldingSetImpl::FindNodeOrInsertPos(llvm::FoldingSetN
+ 0.20% llc libLLVM-3.5.so [.] (anonymous
namespace)::CodeGenPrepare::OptimizeInst(llvm::I
+ 0.20% llc libLLVM-3.5.so [.]
llvm::SelectionDAG::MorphNodeTo(llvm::SDNode*, unsigned int
+ 0.19% llc libLLVM-3.5.so [.]
llvm::LiveRangeCalc::extendToUses(llvm::LiveRange&, unsigne
+ 0.19% llc libLLVM-3.5.so [.] llvm::DAGTypeLegalizer::run()
+ 0.19% llc libLLVM-3.5.so [.]
llvm::X86InstrInfo::isUnpredicatedTerminator(llvm::MachineI
+ 0.18% llc libLLVM-3.5.so [.] bool
llvm::DenseMapBase<llvm::DenseMap<llvm::DenseMapAPIntK
+ 0.18% llc libLLVM-3.5.so [.]
llvm::LiveRange::extendInBlock(llvm::SlotIndex, llvm::SlotI
+ 0.18% llc libLLVM-3.5.so [.]
llvm::formatted_raw_ostream::PadToColumn(unsigned int)
+ 0.17% llc libLLVM-3.5.so [.]
llvm::FoldingSetNodeID::AddPointer(void const*)
+ 0.16% llc libLLVM-3.5.so [.]
llvm::TargetLoweringBase::getTypeConversion(llvm::LLVMConte
+ 0.16% llc libc-2.19.so [.] __memcpy_sse2_unaligned
In both cases, the command to profile is:
perf record -g /tools/llvm/3.5.1rel/bin/llc -O2 LMtb_ethernet.ll -o opt.o
I am not using ASan or UBSan.
(In reply to comment #8)
With debug metadata:
- 23.07% llc libLLVM-3.5.so [.] llvm::SlotTracker::CreateMetadataSlot(llvm::MDNode const*)
- 22.87% llc libLLVM-3.5.so [.] llvm::SlotTracker::processModule()
- 9.80% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::Value const*, unsigned int, llvm::Dens
- 4.81% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::MDNode const*, unsigned int, llvm::Den
SlotTracker is only used in AsmWriter
, so something is writing LLVM assembly somewhere. If you add -debug
, this typically happens, and will be very slow.
You should set a breakpoint on llvm::SlotTracker::processModule()
and send a backtrace. If you're not specifying -debug
on the command-line, nothing should be calling the LLVM assembly writer.
- 3.10% llc libLLVM-3.5.so [.] llvm::MDNode::getOperand(unsigned int) const
This is kind of fishy. This shouldn't really be that slow. Although maybe it used to be (this is 3.5.1, right?).
Without debug metadata:
- 29.77% llc libLLVM-3.5.so [.] llvm::SlotTracker::processModule()
- 16.39% llc libLLVM-3.5.so [.] llvm::DenseMap<llvm::Value const*, unsigned int, llvm::Dens
- 4.23% llc libLLVM-3.5.so [.] llvm::DenseMapBase<llvm::DenseMap<llvm::Value const*, unsig
- 3.65% llc libLLVM-3.5.so [.]
This just looks like the above with the metadata parts no longer showing up. Same problem: someone is writing LLVM assembly where they shouldn't.
(In reply to comment #3)
> Duncan recently fixed a metadata reading performance issue.
BTW, that issue never made it into a release (I introduced the problem *after*
the 3.5 branch).
(In reply to comment #9)
SlotTracker is only used in
AsmWriter
, so something is writing LLVM assembly somewhere. If you add-debug
, this typically happens, and will be very slow.You should set a breakpoint on
llvm::SlotTracker::processModule()
and send a backtrace. If you're not specifying-debug
on the command-line, nothing should be calling the LLVM assembly writer.
BTW, this kind of pathological performance could be a single call to Value::print()
or Value::dump()
, since that rolls up a SlotTracker
for the entire module. Put that in a tight loop and explode. (-debug
or -debug-only=<some-pass>
reliably triggers this since the otherwise no-op DEBUG()
calls contain a lot of dump()
and print()
calls.)
Hmm... the problem with opt was that I didn't provide '-filetype=obj'. With
that, the actual time matches what -time-passes prints: about 40 seconds.
This leaves me puzzled, as my compiler still takes 80 seconds to generate all
the code. As a test, I commented out the actual call into LLVM to optimize and
generate each object. With that out, the rest of my compiler (parsing, semantic
analysis, elaboration, passes prior to final code generation, code generation,
IR printing, etc.) takes 8 seconds.
I also timed llc on each object file generated by my compiler. With the
exception of the bitcode I've attached to this bug, compile times for the
others were 1-2 seconds in total.
Here's a perf report of my compiler:
+ 6.77% lt-dsim libLLVM-3.5.so [.]
llvm::DominatorTreeBase<llvm::BasicBlock>::dominates(llvm::BasicBlock const*,
llvm::BasicBlock const*) con
+ 4.48% lt-dsim libLLVM-3.5.so [.] llvm::LiveRange::verify() const
+ 2.33% lt-dsim libLLVM-3.5.so [.]
llvm::LiveRange::extendSegmentEndTo(llvm::LiveRange::Segment*, llvm::SlotIndex)
+ 1.94% lt-dsim libLLVM-3.5.so [.]
llvm::MachineRegisterInfo::clearKillFlags(unsigned int) const
+ 1.80% lt-dsim libLLVM-3.5.so [.]
llvm::SmallPtrSetImplBase::insert_imp(void const*)
+ 1.76% lt-dsim libLLVM-3.5.so [.]
llvm::InstCombiner::DoOneIteration(llvm::Function&, unsigned int)
+ 1.74% lt-dsim libLLVM-3.5.so [.] llvm::sys::MemoryFence()
+ 1.48% lt-dsim libLLVM-3.5.so [.]
llvm::PassRegistry::getPassInfo(void const*) const
+ 1.32% lt-dsim libLLVM-3.5.so [.]
llvm::SmallVectorImpl<llvm::LiveRange::Segment>::insert(llvm::LiveRange::Segment*,
llvm::LiveRange::Segmen
+ 1.31% lt-dsim libc-2.19.so [.] _int_malloc
+ 1.22% lt-dsim libLLVM-3.5.so [.]
llvm::computeKnownBits(llvm::Value*, llvm::APInt&, llvm::APInt&,
llvm::DataLayout const*, unsigned int)
+ 1.21% lt-dsim libLLVM-3.5.so [.]
llvm::PMDataManager::findAnalysisPass(void const*, bool)
+ 1.10% lt-dsim libLLVM-3.5.so [.]
llvm::PMTopLevelManager::findAnalysisPass(void const*)
+ 1.09% lt-dsim libLLVM-3.5.so [.]
llvm::LiveRange::find(llvm::SlotIndex)
+ 1.06% lt-dsim libLLVM-3.5.so [.]
llvm::InstCombiner::SimplifyDemandedUseBits(llvm::Value*, llvm::APInt,
llvm::APInt&, llvm::APInt&, unsigne
+ 0.94% lt-dsim libLLVM-3.5.so [.] llvm::Use::getUser() const
+ 0.92% lt-dsim [kernel.kallsyms] [k] 0xffffffff811798ca
+ 0.77% lt-dsim libc-2.19.so [.] _int_free
+ 0.70% lt-dsim libLLVM-3.5.so [.]
llvm::LiveRange::extendInBlock(llvm::SlotIndex, llvm::SlotIndex)
+ 0.70% lt-dsim libLLVM-3.5.so [.]
llvm::PassRegistry::getPassRegistry()
+ 0.66% lt-dsim libLLVM-3.5.so [.]
llvm::SwitchInst::getSuccessor(unsigned int) const
+ 0.65% lt-dsim libc-2.19.so [.] memset
+ 0.64% lt-dsim libLLVM-3.5.so [.]
llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const
+ 0.63% lt-dsim libLLVM-3.5.so [.]
llvm::GetUnderlyingObject(llvm::Value*, llvm::DataLayout const*, unsigned int)
+ 0.60% lt-dsim libc-2.19.so [.] malloc_consolidate
+ 0.56% lt-dsim libLLVM-3.5.so [.]
llvm::LiveRange::extendSegmentStartTo(llvm::LiveRange::Segment*,
llvm::SlotIndex)
+ 0.53% lt-dsim libLLVM-3.5.so [.]
llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*,
unsigned int)
+ 0.52% lt-dsim libc-2.19.so [.] malloc
(In reply to comment #12)
Hmm... the problem with opt was that I didn't provide '-filetype=obj'. With that, the actual time matches what -time-passes prints: about 40 seconds.
This leaves me puzzled, as my compiler still takes 80 seconds to generate all the code. As a test, I commented out the actual call into LLVM to optimize and generate each object. With that out, the rest of my compiler (parsing, semantic analysis, elaboration, passes prior to final code generation, code generation, IR printing, etc.) takes 8 seconds.
You should to figure out what's calling into AsmWriter
and instantiating the SlotTracker
, since (1) it's wrong and (2) it's dominating your profile.
timing
(13218 bytes, application/octet-stream)