Closed Quuxplusone closed 8 years ago
Attached big-global.cc
(1570 bytes, text/x-c++src): reproduction
As Chandler correctly explains, the problem is unrelated to r159191 -- it was
just triggered by it.
Here is a simpler repro which does not depend on r159191
% cat asan_glob.sh
#!/bin/bash
RANDOM=123 # seed
echo "int foo();"
echo "void bar(int *a) {"
for ((i = 0; i < $1; i++)); do
echo " a[$RANDOM] = foo();"
done
echo "}"
% for((i=1024; i <= 8192; i*=2)); do ./asan_glob.sh $i > a.c; echo $i; time
clang -O2 -faddress-sanitizer -c a.c ; done
1024
TIME: real: 2.381; user: 1.550; system: 0.050
2048
TIME: real: 4.622; user: 3.650; system: 0.100
4096
TIME: real: 17.108; user: 12.670; system: 0.320
8192
TIME: real: 78.303; user: 62.300; system: 0.480
28.47% [.] llvm::LiveInterval::extendIntervalEndTo(llvm::LiveRange*, llvm::SlotIndex)
6.88% [.] llvm::SlotIndex::operator<(llvm::SlotIndex) const
3.79% [.] findLocalKill(unsigned int, llvm::MachineBasicBlock*, llvm::MachineInstr*, llvm::MachineRegisterInfo*, llvm::DenseMap<llvm::MachineInstr*, unsigned int, llvm::DenseMapInfo<llvm»
3.65% [.] llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&)
3.52% [.] llvm::ilist_sentinel_traits<llvm::SparseBitVectorElement<128u> >::ensureHead(llvm::SparseBitVectorElement<128u>*&)
3.40% [.] llvm::LiveInterval::overlapsFrom(llvm::LiveInterval const&, llvm::LiveRange const*) const
2.98% [.] llvm::SlotIndex::operator<=(llvm::SlotIndex) const
2.81% [.] llvm::Instruction::setParent(llvm::BasicBlock*)
2.57% [.] llvm::LiveInterval::addRangeFrom(llvm::LiveRange, llvm::LiveRange*)
2.38% [.] llvm::SparseBitVector<128u>::FindLowerBound(unsigned int)
2.38% [.] llvm::MachineRegisterInfo::defusechain_iterator<true, false, true>::operator++()
1.99% [.] llvm::LiveIntervalUnion::Query::collectInterferingVRegs(unsigned int)
1.50% [.] (anonymous namespace)::MachineBlockPlacement::selectBestCandidateBlock((anonymous namespace)::BlockChain&, llvm::SmallVectorImpl<llvm::MachineBasicBlock*>&, llvm::SmallPtrSet<l»
1.33% [.] llvm::LiveIntervals::getMBBFromIndex(llvm::SlotIndex) const
1.21% [.] void std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> >::_M_range_insert<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBloc»
1.20% [.] llvm::DenseMapBase<llvm::DenseMap<llvm::MachineBasicBlock*, (anonymous namespace)::BlockChain*, llvm::DenseMapInfo<llvm::MachineBasicBlock*> >, llvm::MachineBasicBlock*, (anony»
1.07% [.] llvm::SplitAnalysis::calcLiveBlockInfo()
1.02% [.] llvm::LiveVariables::MarkVirtRegAliveInBlock(llvm::LiveVariables::VarInfo&, llvm::MachineBasicB
This clearly points to http://llvm.org/bugs/show_bug.cgi?id=12652
Looking more...
(In reply to comment #2)
> As Chandler correctly explains, the problem is unrelated to r159191 -- it was
> just triggered by it.
>
> Here is a simpler repro which does not depend on r159191
This is reproducing a different performance problem AFAICT. I was seeing well
over 80% of the cycles spent in MachineCSE, not in the live interval splitting
code.
strangely, the profile on the original repro looks different:
20.37% clang clang [.]
llvm::MachineInstr::isIdenticalTo(llvm::MachineInstr const*,
llvm::MachineInstr::MICheckType) const
13.13% clang clang [.] llvm::MachineOperand::getReg()
const
12.55% clang clang [.] bool
llvm::DenseMapBase<llvm::DenseMap<llvm::MachineInstr*,
llvm::ScopedHashTableVal<llvm::MachineInstr*, unsigned int>*,
llvm::MachineInstrExpression
4.24% clang clang [.]
llvm::Instruction::setParent(llvm::BasicBlock*)
reverting 159191 is easy.
but I'd still rather keep it (so that we see other problem, if any).
Meanwhile, since this is blocking you, we can ask asan to not instrument more
than N (=10K) instructions in any given BB.
This will temporary solve our current problem as well as probably others (e.g.
12652) and give us time to fix quadratic algorithms in codegen and maybe invent
codegen-friendly asan instrumentation.
WDYT?
(In reply to comment #5)
> reverting 159191 is easy.
> but I'd still rather keep it (so that we see other problem, if any).
> Meanwhile, since this is blocking you, we can ask asan to not instrument more
> than N (=10K) instructions in any given BB.
> This will temporary solve our current problem as well as probably others (e.g.
> 12652) and give us time to fix quadratic algorithms in codegen and maybe
invent
> codegen-friendly asan instrumentation.
>
> WDYT?
I think either reverting or turning off instrumentation are both extremely
temporary hacks. I wouldn't expect either to stay in tree more than a few days,
so i wouldn't worry too much about "see other problems" bit...
I plan to fix MachineCSE ASAP. That thing is just broken, and it must be fixed.
As soon as we do, we can re-instate the patch, or revert the hard limit....
That said, I don't really care what the fix is. ;] I want us to not have the
regression in the interim, and I want to fix these issues rather quickly as
this is only likely to become a bigger problem for us, not a smaller one.
Anyways, if you can get whatever workaround you want to use in place tonight,
that would be awesome. We can pick up with our release process tomorrow
morning. I'm out for the interim, and will be busy tomorrow, but plan to take a
machete to the dense map in MachineCSE once I get some spare minutes. ;]
Also, as I stare at the asan instrumentation, I'm getting ideas about fixes to
that....
r159344 adds a workaround: it adds a flag -mllvm -asan-max-ins-per-bb=N with
default N=10000.
It helps your minimized test, please check if it helps your original test.
The Chandler's idea how to make asan more cg-friendly:
Currently, given asan behaves likes this:
% cat store2.cc
void foo(int *a, int *b) {
*a = 0;
*b = 1;
}
% clang -O2 -S -o - -emit-llvm -faddress-sanitizer store2.cc
...
; <label>:19 ; preds = %14
call void @__asan_report_store4(i64 %0) noreturn nounwind
unreachable
...
; <label>:25 ; preds = %20
call void @__asan_report_store4(i64 %7) noreturn nounwind
unreachable
instead of generating multiple BBs with __asan_report_*,
generate just one BB per each __asan_report_* function and have one large PHI
node in each such BB.
I'll give it a try.
I forgot an important reason why asan should *not* merge different calls to
__asan_report_*
When reporting a bug, asan relies on debug info attached to a particular call.
If we merge the calls, the stack traces in the reports will have incorrect top
frames.
Attached x.ll.bz2
(397115 bytes, application/octet-stream): pre-expanded reproduction
You can get the old behavior by passing
-mllvm -asan-max-ins-per-bb=99999999
(copied from llvm-commits thread)
I've committed (r160361) the changes that merge asan crash callbacks (under a
flag).
% cat ~/tmp/store2.cc
void foo(int *a, int *b) {
*a = 0;
*b = 1;
}
% clang -O2 -faddress-sanitizer -S -o - -mllvm -asan-merge-callbacks=1
~/tmp/store2.cc -emit-llvm 2>&1
...
crash_bb-w-4: ; preds = %16, %21
%14 = phi i64 [ %7, %21 ], [ %0, %16 ]
%15 = phi i64 [ 846930886, %21 ], [ 1804289383, %16 ]
call void @__asan_report_store4(i64 %14, i64 %15) noreturn nounwind
unreachable
...
This produces 5% bigger code:
1.08 400.perlbench
1.08 401.bzip2
1.07 403.gcc
1.00 429.mcf
1.01 445.gobmk
1.08 456.hmmer
1.03 458.sjeng
1.02 462.libquantum
1.08 464.h264ref
1.05 473.astar
1.07 483.xalancbmk
1.04 433.milc
1.08 444.namd
1.07 447.dealII
1.08 450.soplex
1.07 453.povray
1.00 470.lbm
1.06 482.sphinx3
This also produces 1%-4% slower code (I've run only 3 benchmarks)
400.perlbench, 1061.00, 1072.00, 1.01
401.bzip2, 910.00, 923.00, 1.01
403.gcc, 623.00, 645.00, 1.04
Besides, the reproducer from http://llvm.org/bugs/show_bug.cgi?id=13225 builds
3x slower!
I guess that these extra PHI nodes produce much more stress for CG than the
extra BBs.
% time clang -O2 -c -w -mllvm -asan-merge-callbacks=0 -mllvm -asan-max-ins-
per-bb=99999999 -faddress-sanitizer big-global.cc
TIME: real: 8.970; user: 8.690; system: 0.100
% time clang -O2 -c -w -mllvm -asan-merge-callbacks=1 -mllvm -asan-max-ins-
per-bb=99999999 -faddress-sanitizer big-global.cc
TIME: real: 25.629; user: 25.350; system: 0.080
%
The current implementation passes a random constant instead of the PC to the
callback.
Getting the real PC might end up even more expensive.
We can not get the PC on x86_32 at all, so we will need non-trivial changes in
the run-time to accept the module offset.
Finally, this adds 150 lines to the asan compiler module.
So, I am tempted to rollback all these changes.
WDYT?
r161757 removed --asan-merge-callbacks
Seems to be fixed long ago.
big-global.cc
(1570 bytes, text/x-c++src)x.ll.bz2
(397115 bytes, application/octet-stream)