AddressSanitizer produces intractibly slow compiles for global initializers

Quuxplusone commented 12 years ago


Bugzilla Link	PR13225
Status	RESOLVED FIXED
Importance	P enhancement
Reported by	Chandler Carruth (chandlerc@gmail.com)
Reported on	2012-06-28 03:36:13 -0700
Last modified on	2015-11-30 18:06:40 -0800
Version	unspecified
Hardware	Other All
CC	kcc@google.com, llvm-bugs@lists.llvm.org, matthewbg@google.com, nlewycky@google.com, samsonov@google.com
Fixed by commit(s)
Attachments	`big-global.cc` (1570 bytes, text/x-c++src) `x.ll.bz2` (397115 bytes, application/octet-stream)
Blocks
Blocked by
See also

After r159191 where Kostya enabled ASan instrumentation on global init
functions, an old failure mode for ASan has become critical: it creates too
many basic blocks.

Specifically, when instrumenting loads and stores, ASan creates new basic
blocks.

Very large, mechanically generated global initializers (not uncommon sadly,
more common than other large functions) produce *huge* global init functions.

When we enable ASan on the latter, it results in terrible slowdowns of the
entire optimizer chain.

There are several conspiring bugs here:

1) MachineCSE seems to have N^2 (at least) complexity w.r.t. these *particular*
basic block patterns. It's far and away the worst offender, and it clearly is
an algorithmic bug within that routine. With that fixed we get down to
extremely slow compiles rather than interminable compiles.

2) AddressSanitizer could almost certainly do something more clever to avoid
the basic block count it currently imposes.

3) MachineBlockPlacement and a couple of other codegen passes are also slower
than they "should be" for these inputs. They aren't the root of the problem,
but they probably deserve some attention even as the other two aspects are
fixed.

Until either #1 or #2 is fixed, I think r159191 has to be backed out, as this
is... very frustratingly common.

I've attached a preprocessor-based C program that can be used to stress test
these issues. Uncomment lines to watch the problem get worse. This could almost
certainly be simplified, but it gives very nice "real world" sets of loads and
stores which we can watch go throuh the optimization and codegen pipelines. The
loads and stores are essentially identical in this test case to real world code
we hit with this problem.

Quuxplusone commented 12 years ago

Attached big-global.cc (1570 bytes, text/x-c++src): reproduction

Quuxplusone commented 12 years ago

As Chandler correctly explains, the problem is unrelated to r159191 -- it was
just triggered by it.

Here is a simpler repro which does not depend on r159191

% cat asan_glob.sh
#!/bin/bash

RANDOM=123 # seed
echo "int foo();"
echo "void bar(int *a) {"
for ((i = 0; i < $1; i++)); do
  echo "  a[$RANDOM] = foo();"
done
echo "}"

% for((i=1024; i <= 8192; i*=2)); do ./asan_glob.sh $i  > a.c; echo $i; time
clang -O2 -faddress-sanitizer -c a.c ; done
1024
TIME: real: 2.381; user: 1.550; system: 0.050
2048
TIME: real: 4.622; user: 3.650; system: 0.100
4096
TIME: real: 17.108; user: 12.670; system: 0.320
8192
TIME: real: 78.303; user: 62.300; system: 0.480

    28.47%     [.] llvm::LiveInterval::extendIntervalEndTo(llvm::LiveRange*, llvm::SlotIndex)
     6.88%     [.] llvm::SlotIndex::operator<(llvm::SlotIndex) const
     3.79%     [.] findLocalKill(unsigned int, llvm::MachineBasicBlock*, llvm::MachineInstr*, llvm::MachineRegisterInfo*, llvm::DenseMap<llvm::MachineInstr*, unsigned int, llvm::DenseMapInfo<llvm»
     3.65%     [.] llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&)
     3.52%     [.] llvm::ilist_sentinel_traits<llvm::SparseBitVectorElement<128u> >::ensureHead(llvm::SparseBitVectorElement<128u>*&)
     3.40%     [.] llvm::LiveInterval::overlapsFrom(llvm::LiveInterval const&, llvm::LiveRange const*) const
     2.98%     [.] llvm::SlotIndex::operator<=(llvm::SlotIndex) const
     2.81%     [.] llvm::Instruction::setParent(llvm::BasicBlock*)
     2.57%     [.] llvm::LiveInterval::addRangeFrom(llvm::LiveRange, llvm::LiveRange*)
     2.38%     [.] llvm::SparseBitVector<128u>::FindLowerBound(unsigned int)
     2.38%     [.] llvm::MachineRegisterInfo::defusechain_iterator<true, false, true>::operator++()
     1.99%     [.] llvm::LiveIntervalUnion::Query::collectInterferingVRegs(unsigned int)
     1.50%     [.] (anonymous namespace)::MachineBlockPlacement::selectBestCandidateBlock((anonymous namespace)::BlockChain&, llvm::SmallVectorImpl<llvm::MachineBasicBlock*>&, llvm::SmallPtrSet<l»
     1.33%     [.] llvm::LiveIntervals::getMBBFromIndex(llvm::SlotIndex) const
     1.21%     [.] void std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> >::_M_range_insert<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBloc»
     1.20%     [.] llvm::DenseMapBase<llvm::DenseMap<llvm::MachineBasicBlock*, (anonymous namespace)::BlockChain*, llvm::DenseMapInfo<llvm::MachineBasicBlock*> >, llvm::MachineBasicBlock*, (anony»
     1.07%     [.] llvm::SplitAnalysis::calcLiveBlockInfo()
     1.02%     [.] llvm::LiveVariables::MarkVirtRegAliveInBlock(llvm::LiveVariables::VarInfo&, llvm::MachineBasicB

This clearly points to http://llvm.org/bugs/show_bug.cgi?id=12652

Looking more...

Quuxplusone commented 12 years ago

(In reply to comment #2)
> As Chandler correctly explains, the problem is unrelated to r159191 -- it was
> just triggered by it.
>
> Here is a simpler repro which does not depend on r159191

This is reproducing a different performance problem AFAICT. I was seeing well
over 80% of the cycles spent in MachineCSE, not in the live interval splitting
code.

Quuxplusone commented 12 years ago

strangely, the profile on the original repro looks different:

    20.37%    clang  clang                [.]
llvm::MachineInstr::isIdenticalTo(llvm::MachineInstr const*,
llvm::MachineInstr::MICheckType) const
    13.13%    clang  clang                [.] llvm::MachineOperand::getReg()
const
    12.55%    clang  clang                [.] bool
llvm::DenseMapBase<llvm::DenseMap<llvm::MachineInstr*,
llvm::ScopedHashTableVal<llvm::MachineInstr*, unsigned int>*,
llvm::MachineInstrExpression
     4.24%    clang  clang                [.]
llvm::Instruction::setParent(llvm::BasicBlock*)

Quuxplusone commented 12 years ago

reverting 159191 is easy.
but I'd still rather keep it (so that we see other problem, if any).
Meanwhile, since this is blocking you, we can ask asan to not instrument more
than N (=10K) instructions in any given BB.
This will temporary solve our current problem as well as probably others (e.g.
12652) and give us time to fix quadratic algorithms in codegen and maybe invent
codegen-friendly asan instrumentation.

WDYT?

Quuxplusone commented 12 years ago

(In reply to comment #5)
> reverting 159191 is easy.
> but I'd still rather keep it (so that we see other problem, if any).
> Meanwhile, since this is blocking you, we can ask asan to not instrument more
> than N (=10K) instructions in any given BB.
> This will temporary solve our current problem as well as probably others (e.g.
> 12652) and give us time to fix quadratic algorithms in codegen and maybe
invent
> codegen-friendly asan instrumentation.
>
> WDYT?

I think either reverting or turning off instrumentation are both extremely
temporary hacks. I wouldn't expect either to stay in tree more than a few days,
so i wouldn't worry too much about "see other problems" bit...

I plan to fix MachineCSE ASAP. That thing is just broken, and it must be fixed.
As soon as we do, we can re-instate the patch, or revert the hard limit....

That said, I don't really care what the fix is. ;] I want us to not have the
regression in the interim, and I want to fix these issues rather quickly as
this is only likely to become a bigger problem for us, not a smaller one.

Anyways, if you can get whatever workaround you want to use in place tonight,
that would be awesome. We can pick up  with our release process tomorrow
morning. I'm out for the interim, and will be busy tomorrow, but plan to take a
machete to the dense map in MachineCSE once I get some spare minutes. ;]

Also, as I stare at the asan instrumentation, I'm getting ideas about fixes to
that....

Quuxplusone commented 12 years ago

r159344 adds a workaround: it adds a flag  -mllvm -asan-max-ins-per-bb=N with
default N=10000.
It helps your minimized test, please check if it helps your original test.

Quuxplusone commented 12 years ago

The Chandler's idea how to make asan more cg-friendly:

Currently, given asan behaves likes this:

% cat store2.cc
void foo(int *a, int *b) {
  *a = 0;
  *b = 1;
}

% clang -O2 -S -o - -emit-llvm -faddress-sanitizer store2.cc
...
; <label>:19                                      ; preds = %14
  call void @__asan_report_store4(i64 %0) noreturn nounwind
  unreachable
...
; <label>:25                                      ; preds = %20
  call void @__asan_report_store4(i64 %7) noreturn nounwind
  unreachable

instead of generating multiple BBs with __asan_report_*,
generate just one BB per each __asan_report_* function and have one large PHI
node in each such BB.
I'll give it a try.

Quuxplusone commented 12 years ago

I forgot an important reason why asan should *not* merge different calls to
__asan_report_*
When reporting a bug, asan relies on debug info attached to a particular call.
If we merge the calls, the stack traces in the reports will have incorrect top
frames.

Quuxplusone commented 12 years ago

Attached x.ll.bz2 (397115 bytes, application/octet-stream): pre-expanded reproduction

Quuxplusone commented 12 years ago

You can get the old behavior by passing
   -mllvm -asan-max-ins-per-bb=99999999

Quuxplusone commented 12 years ago

(copied from llvm-commits thread)

I've committed (r160361) the changes that merge asan crash callbacks (under a
flag).

% cat ~/tmp/store2.cc
void foo(int *a, int *b) {
  *a = 0;
  *b = 1;
}
% clang -O2 -faddress-sanitizer -S -o - -mllvm -asan-merge-callbacks=1
~/tmp/store2.cc -emit-llvm 2>&1
...
crash_bb-w-4:                                     ; preds = %16, %21
  %14 = phi i64 [ %7, %21 ], [ %0, %16 ]
  %15 = phi i64 [ 846930886, %21 ], [ 1804289383, %16 ]
  call void @__asan_report_store4(i64 %14, i64 %15) noreturn nounwind
  unreachable
...

This produces 5% bigger code:
1.08    400.perlbench
1.08    401.bzip2
1.07    403.gcc
1.00    429.mcf
1.01    445.gobmk
1.08    456.hmmer
1.03    458.sjeng
1.02    462.libquantum
1.08    464.h264ref
1.05    473.astar
1.07    483.xalancbmk
1.04    433.milc
1.08    444.namd
1.07    447.dealII
1.08    450.soplex
1.07    453.povray
1.00    470.lbm
1.06    482.sphinx3

This also produces 1%-4% slower code (I've run only 3 benchmarks)
       400.perlbench,      1061.00,      1072.00,         1.01
           401.bzip2,       910.00,       923.00,         1.01
             403.gcc,       623.00,       645.00,         1.04

Besides, the reproducer from http://llvm.org/bugs/show_bug.cgi?id=13225 builds
3x slower!
I guess that these extra PHI nodes produce much more stress for CG than the
extra BBs.
% time clang -O2 -c  -w -mllvm -asan-merge-callbacks=0  -mllvm -asan-max-ins-
per-bb=99999999   -faddress-sanitizer big-global.cc
TIME: real: 8.970; user: 8.690; system: 0.100
% time clang -O2 -c  -w -mllvm -asan-merge-callbacks=1  -mllvm -asan-max-ins-
per-bb=99999999   -faddress-sanitizer big-global.cc
TIME: real: 25.629; user: 25.350; system: 0.080
%

The current implementation passes a random constant instead of the PC to the
callback.
Getting the real PC might end up even more expensive.
We can not get the PC on x86_32 at all, so we will need non-trivial changes in
the run-time to accept the module offset.

Finally, this adds 150 lines to the asan compiler module.

So, I am tempted to rollback all these changes.
WDYT?

Quuxplusone commented 12 years ago

r161757 removed --asan-merge-callbacks

Quuxplusone commented 8 years ago

Seems to be fixed long ago.

Quuxplusone / LLVMBugzillaTest

AddressSanitizer produces intractibly slow compiles for global initializers #13303