Closed kripken closed 4 years ago
I guess it's O(N^5) in worst case: ControlFlowWalker
> while
> outer for
> inner for
-> BranchUtils::BranchSeeker::has (PostWalker)
Btw tree traversing I gues could be improved in general by improving cache locality (via pre-sorting) and other technique described in this papper: https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1053&context=open_access_dissertations
I took a look at the MergeBlocks pass, but it was intimidatingly complicated. A version of this pass (that does much less stuff) in Poppy IR is extremely simple, so I wonder if rewriting this pass in terms of Poppy IR would make it simpler. The conversion overhead wouldn't make it any faster of course, but if it were simpler, perhaps it would be easier to fix the performance problem.
I think we do need it in binaryen IR, since it helps other binaryen IR optimizations (so we can't let it run at the end with poppy IR). But yeah, this type of block operation may be simpler there...
Looking into this, @MaxGraey you are right that it's the call to BranchUtils::BranchSeeker::has
that leads to O(N^2)
behavior (not worse - there are more loops, but if N
is the total number of nodes, they are distributed amongs the loops etc.).
The naive fix would be to compute the results for BranchUtils::BranchSeeker::has
up front once in linear time, and have a map of expression => all the branches it has
. This however requires a massive amount of memory because the pattern in that test is a huge tower of nested blocks with a switch at the top. The switch has lots of targets, so the map entries are very large. Testing this it takes more memory than my laptop has (!) so it's a bad idea. I'm looking into something more subtle.
I think we do need it in binaryen IR, since it helps other binaryen IR optimizations (so we can't let it run at the end with poppy IR). But yeah, this type of block operation may be simpler there...
I would not want to limit this to run at the end. Instead, I would have it stackify and unstackify as subpasses or I would depend on the pass runner to insert those conversions automatically.
Looking into this, @MaxGraey you are right that it's the call to
BranchUtils::BranchSeeker::has
that leads toO(N^2)
behavior (not worse - there are more loops, but ifN
is the total number of nodes, they are distributed amongs the loops etc.).
This is also what leads to quadratic behavior in the validator. Would it work to integrate the BranchSeeker processing into a Walker
that MergeBlocks and Validation can use to get constant-time seeking for each visited block?
@tlively
I would not want to limit this to run at the end.
Just to make sure we're on the same page, I think running at the end is undeniably a good idea, and the others need to be evaluated. For running at the start (during load), we need to check it's not significantly slower. For running in the middle (interleaved with other passes) I think this adds complexity (better to focus on a single IR for almost all passes, and to avoid IR conversion costs) and I'm unsure we'll have passes that actually need it - but as discussed earlier I am open to data showing otherwise.
This is also what leads to quadratic behavior in the validator. Would it work to integrate the BranchSeeker processing into a Walker that MergeBlocks and Validation can use to get constant-time seeking for each visited block?
Yeah, that's the direction I'm looking at. Good point that this could help the validator too, I'll make sure to include that.
@tlively I'm not sure I see where the validator has quadratic behavior here? It doesn't use BranchUtils or have any walks that look suspicious, and visitBlock
looks linear to me. But maybe I'm missing something?
Oh sorry, I got confused between the branch seeker and the refinalizer, which has similar quadratic behavior :(
Hmm, reading the ReFinalizer
I also don't see quadratic behavior? It looks like it tracks breaks incrementally as it walks, and visitBlock
is linear AFAICT.
However it does look like ReFinalizeNode
finalizes a single node in a way that would be quadratic if it were called constantly as part of a walk. I don't think we do that in any important code paths, though. (AutoDrop is the only risky one I see, but that's not normally run.)
I was thinking of the ReFinalizeNode
usage in BinaryenIRValidator::visitExpression
. On every block, it has the potential to do a full traversal of the subtree via the TypeSeeker
in Block::finalize()
. I just figured out recently that this code path is only taken if pass debugging in on, though, so that probably doesn't qualify as an important code path.
We find out some issue during build self-hosted assemblyscript. It seems PR #3102 may cause to assert in wasm-traversal.h
@MaxGraey Getting a standalone testcase showing the issue would be good. You should be able to get the IR right before MergeBlocks runs and then run the reducer on that.
I guess we could comment part of passes before first merge-blocks and use wasm-reduce afterwards for reduce module to minimum
@MaxGraey if you run with BINARYEN_PASS_DEBUG=3
, you will get files for the state prior to each pass. So you shouldn't have to comment anything out or make any changes to get that intermediate state.
Hopefully @dcodeIO figured out the problem. See https://github.com/emscripten-core/emscripten/issues/12123
That pass takes over 6 minutes (!) on that testcase, so something is going very wrong. Possibly quadratic behavior. The testcase has a huge nested stack of blocks (hence the test name).