Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Building apitrace with clang 3.4 takes 40+ minutes #19498

Closed Quuxplusone closed 10 years ago

Quuxplusone commented 10 years ago
Bugzilla Link PR19499
Status RESOLVED FIXED
Importance P normal
Reported by Jesper Dam (llvm@jalf.dk)
Reported on 2014-04-21 10:32:23 -0700
Last modified on 2014-04-28 05:41:19 -0700
Version trunk
Hardware Macintosh MacOS X
CC chandlerc@gmail.com, llvm-bugs@lists.llvm.org, rafael@espindo.la
Fixed by commit(s)
Attachments glstate_params.cpp.gz (383837 bytes, application/x-gzip)
Blocks
Blocked by
See also
Downloaded https://github.com/apitrace/apitrace from github and tried to build,
ollowing the instructions from their INSTALL file, using clang 3.4 (from Xcode
5.1) on my MacBook Pro

apitrace HEAD is 8d6c36d74ba9aefbc8c3618fc93dd4907a0dbf5e

> cmake -H. -DCMAKE_BUILD_TYPE=RelWithDebInfo
> make -C build

After 40 or so minutes, it was still chewing on the file glretrace_gl.cpp

Attaching lldb, I got the following stacktrace:

  * frame #0: 0x0000000100735056 clang`llvm::LazyValueInfo::eraseBlock(llvm::BasicBlock*) + 918
    frame #1: 0x0000000100692044 clang`(anonymous namespace)::JumpThreading::runOnFunction(llvm::Function&) + 13364
    frame #2: 0x000000010031604d clang`llvm::FPPassManager::runOnFunction(llvm::Function&) + 333
    frame #3: 0x0000000100319215 clang`(anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) + 1669
    frame #4: 0x0000000100316799 clang`llvm::PassManager::run(llvm::Module&) + 1017
    frame #5: 0x00000001002e6103 clang`clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::Module*, clang::BackendAction, llvm::raw_ostream*) + 12227
    frame #6: 0x00000001002c1f09 clang`clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) + 2169
    frame #7: 0x000000010006ca34 clang`clang::ParseAST(clang::Sema&, bool, bool) + 11316
    frame #8: 0x0000000100066e04 clang`clang::CodeGenAction::ExecuteAction() + 68
    frame #9: 0x0000000100062550 clang`clang::FrontendAction::Execute() + 112
    frame #10: 0x000000010002aa2f clang`clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) + 1855
    frame #11: 0x0000000100006b54 clang`main + 17572

(I gave up after this, so I can't say how long it would have taken for the
build to actually complete)

Configuring cmake with -DCMAKE_BUILD_TYPE=Debug (disabling optimizations) makes
a build complete in a minute or so.
Quuxplusone commented 10 years ago

Can you provide a preprocessed source that takes too long?

Quuxplusone commented 10 years ago

Attached glstate_params.cpp.gz (383837 bytes, application/x-gzip): Preprocessed source. Build with -O2 -c to repro

Quuxplusone commented 10 years ago
The IR in
https://drive.google.com/file/d/0B7iRtublysV6YV9MWGY0bExOWkk/edit?usp=sharing

causes  -loop-simplify alone to take more than 5 minutes.
Quuxplusone commented 10 years ago
(In reply to comment #3)
> The IR in
> https://drive.google.com/file/d/0B7iRtublysV6YV9MWGY0bExOWkk/edit?usp=sharing
>
>
> causes  -loop-simplify alone to take more than 5 minutes.

There is also some interesting pass interactions in here. With -loop-rotate -
jump-threading, jump threading takes 747.5470 seconds, but with just -jump-
threading it takes 5.0183 seconds.
Quuxplusone commented 10 years ago

so, the primary cause of slowness (at least for -loop-simplify, and i suspect everything downstream is related) is a function mapping an enum to a string which contains a switch over 3378 integer values, all of them to empty basic blocks which all branch to the same basic block that then has a 3378 value PHI node. This function then gets inlined over 100 times into the body of some other function.

none of this is actually bad, because after the optimizer chews through this, the code becomes quite simple and re-running the optimizers is fast. when inlining, we should be early-pruning the dead code here to limit the explosion of the CFG of the function.

Looking into why this isn't happening.

Quuxplusone commented 10 years ago
after more investigation, i think there are probably a few separate bugs here...

first, how we get into this mess: we consider the cost of a 'switch' to have
constant time regardless of how many destinations it has. As a consequence, a
switch over 3378 values, with 3378 empty basic blocks, followed by one phi
node, is accounted as have a cost of 2: 1 for the switch, and 1 for the phi
node. This may seem surprising, but with the new switch lowering magic, we
actually *lower* switches to a constant number of instructions in many cases.
But regardless of how we lower the switch, the result is that functions with
100s of calls to a function with a single such switch can *explode* in the
number of basic blocks. This is somewhere between questionable and bad. Perhaps
if we had a 'select' like instruction to model this without CFG edges it would
be cheaper, but I'm not 100% certain.

There also appear to be N^2 algorithms in the *handling* of these control flows
in at least jump-threading and loop-simplify. Regardless of the former, these
should be fixed.

I'm thinking about instituting a linear cost for switch instructions relative
to how many targets they have, or alternatively a linear cost for a branch
instruction relative to how many predecessors the destination has. Not 100%
sure what the best approach is here, but we clearly need something.

We used to not see this because we considered even an unconditional branch to
have cost, which in effect gave linear cost on the size of the switch.
Quuxplusone commented 10 years ago

OK, definitely two bugs here.

The first is inlining the huge switch CFGs. I verified that we are now reliably forming lookup tables early in the IR when we can do so reasonably. Given this, there is no reason to allow inlining huge switches. I have a patch to fix this in the inliner and will commit it shortly. That drops the compile time by over 5x for me.

There is still clearly a bug after that with loop simplify. I'll look into that next.

Quuxplusone commented 10 years ago

r207403 should significantly improve this, it addresses the inliner problems that were exploding the CFG of one function.

There is still the problem with loop-simplify to be diagnosed.

Quuxplusone commented 10 years ago
r207406 addresses the primary problem with loop-simplify, which was actually a
*really* bad performance problem with the domtree code. the fix is quite simple
and reduces the compile time by over 70% for me.

There is still one more performance bug I can see though.
Quuxplusone commented 10 years ago

And the last obvious and easy to fix quadratic behavior is fixed in r207409. The code is still quite slow to compile (25 seconds in opt for me) but overwhelmingly this is due to slowness in LVI and jump threading that are reasonably well known and lang standing. I don't have a bug handy, but I don't know that this bug is tracking anything interesting at this point.

Thanks for the original report! Lots of great fixes stemming from that!