Closed Quuxplusone closed 10 years ago
Can you provide a preprocessed source that takes too long?
Attached glstate_params.cpp.gz
(383837 bytes, application/x-gzip): Preprocessed source. Build with -O2 -c to repro
The IR in
https://drive.google.com/file/d/0B7iRtublysV6YV9MWGY0bExOWkk/edit?usp=sharing
causes -loop-simplify alone to take more than 5 minutes.
(In reply to comment #3)
> The IR in
> https://drive.google.com/file/d/0B7iRtublysV6YV9MWGY0bExOWkk/edit?usp=sharing
>
>
> causes -loop-simplify alone to take more than 5 minutes.
There is also some interesting pass interactions in here. With -loop-rotate -
jump-threading, jump threading takes 747.5470 seconds, but with just -jump-
threading it takes 5.0183 seconds.
so, the primary cause of slowness (at least for -loop-simplify, and i suspect everything downstream is related) is a function mapping an enum to a string which contains a switch over 3378 integer values, all of them to empty basic blocks which all branch to the same basic block that then has a 3378 value PHI node. This function then gets inlined over 100 times into the body of some other function.
none of this is actually bad, because after the optimizer chews through this, the code becomes quite simple and re-running the optimizers is fast. when inlining, we should be early-pruning the dead code here to limit the explosion of the CFG of the function.
Looking into why this isn't happening.
after more investigation, i think there are probably a few separate bugs here...
first, how we get into this mess: we consider the cost of a 'switch' to have
constant time regardless of how many destinations it has. As a consequence, a
switch over 3378 values, with 3378 empty basic blocks, followed by one phi
node, is accounted as have a cost of 2: 1 for the switch, and 1 for the phi
node. This may seem surprising, but with the new switch lowering magic, we
actually *lower* switches to a constant number of instructions in many cases.
But regardless of how we lower the switch, the result is that functions with
100s of calls to a function with a single such switch can *explode* in the
number of basic blocks. This is somewhere between questionable and bad. Perhaps
if we had a 'select' like instruction to model this without CFG edges it would
be cheaper, but I'm not 100% certain.
There also appear to be N^2 algorithms in the *handling* of these control flows
in at least jump-threading and loop-simplify. Regardless of the former, these
should be fixed.
I'm thinking about instituting a linear cost for switch instructions relative
to how many targets they have, or alternatively a linear cost for a branch
instruction relative to how many predecessors the destination has. Not 100%
sure what the best approach is here, but we clearly need something.
We used to not see this because we considered even an unconditional branch to
have cost, which in effect gave linear cost on the size of the switch.
OK, definitely two bugs here.
The first is inlining the huge switch CFGs. I verified that we are now reliably forming lookup tables early in the IR when we can do so reasonably. Given this, there is no reason to allow inlining huge switches. I have a patch to fix this in the inliner and will commit it shortly. That drops the compile time by over 5x for me.
There is still clearly a bug after that with loop simplify. I'll look into that next.
r207403 should significantly improve this, it addresses the inliner problems that were exploding the CFG of one function.
There is still the problem with loop-simplify to be diagnosed.
r207406 addresses the primary problem with loop-simplify, which was actually a
*really* bad performance problem with the domtree code. the fix is quite simple
and reduces the compile time by over 70% for me.
There is still one more performance bug I can see though.
And the last obvious and easy to fix quadratic behavior is fixed in r207409. The code is still quite slow to compile (25 seconds in opt for me) but overwhelmingly this is due to slowness in LVI and jump threading that are reasonably well known and lang standing. I don't have a bug handy, but I don't know that this bug is tracking anything interesting at this point.
Thanks for the original report! Lots of great fixes stemming from that!
glstate_params.cpp.gz
(383837 bytes, application/x-gzip)