Open Quuxplusone opened 4 years ago
Is it enough to set "--x86-experimental-lvi-inline-asm-hardening", and "target-feature=+lvi-cfi,+lvi-load-hardening" to get the LVI pass to run?
For Rust, I'm not 100% certain, as I have not personally compiled any rust code with the LVI hardening.
For clang/clang++, it suffices to compile with the -mlvi-hardening
flag.
I'm running llc on ScalarEvolution.bc (a leftover file I had from some earlier
investigation). It takes about 30 seconds to process this file, and the
oprofile results are:
Using /w/test/oprofile_data/samples/ for samples directory.
CPU: Intel Skylake microarchitecture, speed 3000 MHz (estimated)
Counted cpu_clk_unhalted events () with a unit mask of 0x00 (Core cycles when
at least one thread on the physical core is not in halt state) count 100000
samples % image name symbol name
54378 5.9208 llc unsigned int
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::runDFS<false,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::verifyParentPro
perty(llvm::DominatorTreeBase<llvm::MachineBasicBlock, false>
const&)::{lambda(llvm::MachineBasicBlock*,
llvm::MachineBasicBlock*)#1}>(llvm::MachineBasicBlock*, unsigned int,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::verifyParentProperty(llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> const&)::{lambda(llvm::MachineBasicBlock*, llvm::MachineBasicBlock*)#1},
unsigned int)
47477 5.1694 llc (anonymous
namespace)::MachineVerifier::visitMachineFunctionAfter()
39995 4.3548 llc unsigned int
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::runDFS<false,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::verifyParentProperty(llvm::DominatorTreeBase<llvm::BasicBlock, false>
const&)::{lambda(llvm::BasicBlock*, llvm::BasicBlock*)#1}>(llvm::BasicBlock*,
unsigned int,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::verifyParentProperty(llvm::DominatorTreeBase<llvm::BasicBlock, false>
const&)::{lambda(llvm::BasicBlock*, llvm::BasicBlock*)#1}, unsigned int)
29406 3.2018 llc
llvm::DenseMapBase<llvm::DenseMap<llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::InfoRec, llvm::DenseMapInfo<llvm::MachineBasicBlock*>,
llvm::detail::DenseMapPair<llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::InfoRec> >, llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::InfoRec, llvm::DenseMapInfo<llvm::MachineBasicBlock*>,
llvm::detail::DenseMapPair<llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::InfoRec> >::FindAndConstruct(llvm::MachineBasicBlock* const&)
27112 2.9520 libc-2.27.so _int_malloc
26622 2.8987 llc
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::ChildrenGetter<false>::Get(llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::BatchUpdateInfo*)
24294 2.6452 llc
llvm::DenseMapBase<llvm::DenseMap<llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::InfoRec, llvm::DenseMapInfo<llvm::BasicBlock*>,
llvm::detail::DenseMapPair<llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::InfoRec> >, llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::InfoRec, llvm::DenseMapInfo<llvm::BasicBlock*>,
llvm::detail::DenseMapPair<llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::InfoRec> >::FindAndConstruct(llvm::BasicBlock* const&)
24169 2.6316 llc bool (anonymous
namespace)::VRegFilter::filterAndAdd<llvm::DenseSet<unsigned int,
llvm::DenseMapInfo<unsigned int> > >(llvm::DenseSet<unsigned int,
llvm::DenseMapInfo<unsigned int> > const&, llvm::SmallVectorImpl<unsigned int>&)
23714 2.5821 llc checkForCyclesHelper(llvm::SDNode
const*, llvm::SmallPtrSetImpl<llvm::SDNode const*>&,
llvm::SmallPtrSetImpl<llvm::SDNode const*>&, llvm::SelectionDAG const*)
23583 2.5678 llc (anonymous
namespace)::MachineVerifier::verify(llvm::MachineFunction&)
23307 2.5377 libc-2.27.so free
22229 2.4204 llc
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::ChildrenGetter<false>::Get(llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::BatchUpdateInfo*)
22059 2.4019 llc llvm::SmallPtrSetImpl<llvm::SDNode
const*>::insert(llvm::SDNode const*)
21991 2.3944 llc
llvm::Instruction::getNumSuccessors() const
19755 2.1510 llc (anonymous
namespace)::MachineVerifier::visitMachineOperand(llvm::MachineOperand const*,
unsigned int)
18117 1.9726 llc
llvm::Instruction::getSuccessor(unsigned int) const
15364 1.6729 llc (anonymous
namespace)::MachineVerifier::visitMachineBundleAfter(llvm::MachineInstr const*)
15150 1.6496 llc (anonymous
namespace)::details::StructuralHash::update(llvm::Function&)
14762 1.6073 libc-2.27.so malloc
13687 1.4903 llc void
llvm::SmallVectorImpl<llvm::BasicBlock*>::append<std::__1::reverse_iterator<llvm::SuccIterator<llvm::Instruction,
llvm::BasicBlock> >,
void>(std::__1::reverse_iterator<llvm::SuccIterator<llvm::Instruction,
llvm::BasicBlock> >,
std::__1::reverse_iterator<llvm::SuccIterator<llvm::Instruction,
llvm::BasicBlock> >)
11663 1.2699 llc llvm::detail::DenseSetPair<unsigned
int>* llvm::DenseMapBase<llvm::DenseMap<unsigned int,
llvm::detail::DenseSetEmpty, llvm::DenseMapInfo<unsigned int>,
llvm::detail::DenseSetPair<unsigned int> >, unsigned int,
llvm::detail::DenseSetEmpty, llvm::DenseMapInfo<unsigned int>,
llvm::detail::DenseSetPair<unsigned int> >::InsertIntoBucketImpl<unsigned
int>(unsigned int const&, unsigned int const&,
llvm::detail::DenseSetPair<unsigned int>*)
11226 1.2223 llc
llvm::DenseMapBase<llvm::DenseMap<llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::InfoRec, llvm::DenseMapInfo<llvm::BasicBlock*>,
llvm::detail::DenseMapPair<llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::InfoRec> >, llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::InfoRec, llvm::DenseMapInfo<llvm::BasicBlock*>,
llvm::detail::DenseMapPair<llvm::BasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock,
false> >::InfoRec> >::clear()
10043 1.0935 llc
llvm::DenseMapBase<llvm::DenseMap<llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::InfoRec, llvm::DenseMapInfo<llvm::MachineBasicBlock*>,
llvm::detail::DenseMapPair<llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::InfoRec> >, llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::InfoRec, llvm::DenseMapInfo<llvm::MachineBasicBlock*>,
llvm::detail::DenseMapPair<llvm::MachineBasicBlock*,
llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::MachineBasicBlock,
false> >::InfoRec> >::clear()
8552 0.9312 llc llvm::BasicBlock::getTerminator()
const
7664 0.8345 llc
llvm::rdf::PhysicalRegisterInfo::getAliasSet(unsigned int) const
[...]
The first RDF-related symbol is at the bottom with 0.8% of the entire execution
time (this is not the entire list). Do you have a file that you could share
where the RDF liveness is taking a significant time? I have some improvements
that should reduce the complexity of the ordering stage, but I'm not seeing any
difference with the files I tried.
Attached bigfunc.ll.zip
(371766 bytes, application/x-zip-compressed): Big function compiled from Rust source
Attached bigfunc-lvi.ll.zip
(371806 bytes, application/x-zip-compressed): Big function compiled from Rust source with LVI mitigations
Perf report when using llc to compile bigfunc-lvi.ll:
22.16% llc llc [.] std::_Rb_tree<std::pair<unsigned int, llvm::LaneBitmask>, std::pair<unsigned int, llvm::LaneBitmask>, std::_Identity<std::pair<unsigned int, llvm::LaneBitmask> >, std::less<std::pair<unsigned int,
20.15% llc llc [.] llvm::rdf::Liveness::computePhiInfo
9.43% llc llc [.] llvm::rdf::RegisterAggr::makeRegRef
7.18% llc llc [.] llvm::rdf::RegisterAggr::makeRegRef() const::$_0::operator()
5.47% llc libc-2.31.so [.] malloc
5.35% llc llc [.] llvm::BitVector::find_first_in
5.35% llc libc-2.31.so [.] _int_free
3.97% llc libstdc++.so.6.0.28 [.] std::_Rb_tree_increment
3.84% llc llc [.] llvm::rdf::RegisterAggr::insert
3.77% llc llc [.] llvm::rdf::RegisterAggr::clearIn
1.66% llc libc-2.31.so [.] cfree@GLIBC_2.2.5
1.21% llc libc-2.31.so [.] __memset_avx2_unaligned_erms
0.74% llc libstdc++.so.6.0.28 [.] std::_Rb_tree_decrement
0.39% llc llc [.] memset@plt
0.34% llc llc [.] free@plt
0.34% llc llc [.] malloc@plt
0.33% llc llc [.] llvm::rdf::PhysicalRegisterInfo::getAliasSet
0.31% llc llc [.] llvm::LiveRange::verify
0.27% llc libc-2.31.so [.] _int_malloc
0.17% llc libstdc++.so.6.0.28 [.] std::_Rb_tree_insert_and_rebalance
0.13% llc llc [.] std::_Rb_tree_increment@plt
0.13% llc llc [.] llvm::rdf::DataFlowGraph::releaseBlock
0.13% llc llc [.] llvm::SelectionDAG::Combine
0.12% llc llc [.] std::__detail::_Map_base<unsigned int, std::pair<unsigned int const, llvm::rdf::DataFlowGraph::DefStack>, std::allocator<std::pair<unsigned int const, llvm::rdf::DataFlowGraph::DefStack> >, std::_
0.10% llc llc [.] llvm::sys::unicode::columnWidthUTF8
Note that I am also using the patch to fix the performance issue in the LVI mitigations: https://reviews.llvm.org/D84471. So the remaining overhead appears to be caused by RDF.
I'm surprised by the execution times that you're getting. There is indeed a
compile-time increase related to RDF, but in my experiments it's not 16-fold
(1.2s -> 20s), but in the order of 15%.
On my machine:
--- master
$ time llc bigfunc.ll
real 1m28.141s
user 1m21.665s
sys 0m6.464s
$ time llc bigfunc-lvi.ll
real 2m9.032s
user 2m2.475s
sys 0m6.549s
--- master + D84471
$ time llc bigfunc.ll
real 1m27.341s
user 1m20.780s
sys 0m6.556s
$ time llc bigfunc-lvi.ll
real 1m51.290s
user 1m45.120s
sys 0m6.168s
--- master + D84471 + RDF wip fixes
$ time llc bigfunc.ll
real 1m27.275s
user 1m20.714s
sys 0m6.548s
$ time llc bigfunc-lvi.ll
real 1m40.499s
user 1m34.199s
sys 0m6.292s
(In reply to Krzysztof Parzyszek from comment #8)
> I'm surprised by the execution times that you're getting. There is indeed a
> compile-time increase related to RDF, but in my experiments it's not 16-fold
> (1.2s -> 20s), but in the order of 15%.
>
> On my machine:
>
> --- master
>
> $ time llc bigfunc.ll
> real 1m28.141s
> user 1m21.665s
> sys 0m6.464s
>
> $ time llc bigfunc-lvi.ll
> real 2m9.032s
> user 2m2.475s
> sys 0m6.549s
>
> --- master + D84471
>
> $ time llc bigfunc.ll
> real 1m27.341s
> user 1m20.780s
> sys 0m6.556s
>
> $ time llc bigfunc-lvi.ll
> real 1m51.290s
> user 1m45.120s
> sys 0m6.168s
>
> --- master + D84471 + RDF wip fixes
>
> $ time llc bigfunc.ll
> real 1m27.275s
> user 1m20.714s
> sys 0m6.548s
>
> $ time llc bigfunc-lvi.ll
> real 1m40.499s
> user 1m34.199s
> sys 0m6.292s
I am even more surprised, since the original bug reporter
(https://github.com/rust-lang/rust/issues/74632) recently re-ran his own tests
and observed a ~24-fold increase with D84471, following a 180-fold increase
without D84471. I am at a total loss to explain how you are observing such a
small slowdown. What is the exact LLVM commit that you're building on, so we
can compare apples-to-apples?
I may have expensive checks enabled in my clang build. Let me rebuild everything and try again.
That turned out to be the case... Revised times:
--- master
$ time llc bigfunc.ll
real 0m1.832s
user 0m1.784s
sys 0m0.049s
$ time llc bigfunc-lvi.ll
real 0m41.508s
user 0m41.406s
sys 0m0.101s
--- master + D84471
$ time llc bigfunc.ll
real 0m1.811s
user 0m1.759s
sys 0m0.053s
$ time llc bigfunc-lvi.ll
real 0m27.714s
user 0m27.645s
sys 0m0.069s
--- master + D84471 + RDF wip fixes
$ time llc bigfunc.ll
real 0m1.802s
user 0m1.765s
sys 0m0.037s
$ time llc bigfunc-lvi.ll
real 0m15.475s
user 0m15.398s
sys 0m0.077s
I committed a handful of patches that get the ~50% reduction in compile-time.
Any further improvements would likely require changing the algorithm, but I
don't have that much time at the moment (such a change would need to be
thoroughly tested).
Now I get
$ time llc bigfunc-lvi.ll
real 0m12.970s
user 0m12.884s
sys 0m0.077s
The patches
https://reviews.llvm.org/rG09897b146a8a
https://reviews.llvm.org/rG47fe1b63f449
https://reviews.llvm.org/rGf0f467aeecfc
https://reviews.llvm.org/rG4b25f672998f
https://reviews.llvm.org/rG9521704553e8
Not all of these are strictly performance-related, but there may be
dependencies between them.
bigfunc.ll.zip
(371766 bytes, application/x-zip-compressed)bigfunc-lvi.ll.zip
(371806 bytes, application/x-zip-compressed)The RDF live variables analysis probably has high asymptotic complexity than what is actually necessary. In particular:
https://github.com/llvm/llvm-project/blob/b5be1c5419e2a38eb60fc7e785567b54b6d9e0e0/llvm/lib/CodeGen/RDFLiveness.cpp#L208
This may be causing substantial compilation overhead for some users who enable passes that depend on RDF live variable analysis, e.g., https://github.com/rust-lang/rust/issues/74632.
Related discussion on llvm-dev: http://lists.llvm.org/pipermail/llvm-dev/2020-April/141332.html.