WebAssembly / binaryen

Optimizer and compiler/toolchain library for WebAssembly
Apache License 2.0
7.44k stars 737 forks source link

MergeBlocks pass very slow on emscripten wasm3.test_biggerswitch test #3090

Closed kripken closed 4 years ago

kripken commented 4 years ago

That pass takes over 6 minutes (!) on that testcase, so something is going very wrong. Possibly quadratic behavior. The testcase has a huge nested stack of blocks (hence the test name).

MaxGraey commented 4 years ago

I guess it's O(N^5) in worst case: ControlFlowWalker > while > outer for > inner for -> BranchUtils::BranchSeeker::has (PostWalker)

MaxGraey commented 4 years ago

Btw tree traversing I gues could be improved in general by improving cache locality (via pre-sorting) and other technique described in this papper: https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1053&context=open_access_dissertations

tlively commented 4 years ago

I took a look at the MergeBlocks pass, but it was intimidatingly complicated. A version of this pass (that does much less stuff) in Poppy IR is extremely simple, so I wonder if rewriting this pass in terms of Poppy IR would make it simpler. The conversion overhead wouldn't make it any faster of course, but if it were simpler, perhaps it would be easier to fix the performance problem.

kripken commented 4 years ago

I think we do need it in binaryen IR, since it helps other binaryen IR optimizations (so we can't let it run at the end with poppy IR). But yeah, this type of block operation may be simpler there...

Looking into this, @MaxGraey you are right that it's the call to BranchUtils::BranchSeeker::has that leads to O(N^2) behavior (not worse - there are more loops, but if N is the total number of nodes, they are distributed amongs the loops etc.).

The naive fix would be to compute the results for BranchUtils::BranchSeeker::has up front once in linear time, and have a map of expression => all the branches it has. This however requires a massive amount of memory because the pattern in that test is a huge tower of nested blocks with a switch at the top. The switch has lots of targets, so the map entries are very large. Testing this it takes more memory than my laptop has (!) so it's a bad idea. I'm looking into something more subtle.

tlively commented 4 years ago

I think we do need it in binaryen IR, since it helps other binaryen IR optimizations (so we can't let it run at the end with poppy IR). But yeah, this type of block operation may be simpler there...

I would not want to limit this to run at the end. Instead, I would have it stackify and unstackify as subpasses or I would depend on the pass runner to insert those conversions automatically.

Looking into this, @MaxGraey you are right that it's the call to BranchUtils::BranchSeeker::has that leads to O(N^2) behavior (not worse - there are more loops, but if N is the total number of nodes, they are distributed amongs the loops etc.).

This is also what leads to quadratic behavior in the validator. Would it work to integrate the BranchSeeker processing into a Walker that MergeBlocks and Validation can use to get constant-time seeking for each visited block?

kripken commented 4 years ago

@tlively

I would not want to limit this to run at the end.

Just to make sure we're on the same page, I think running at the end is undeniably a good idea, and the others need to be evaluated. For running at the start (during load), we need to check it's not significantly slower. For running in the middle (interleaved with other passes) I think this adds complexity (better to focus on a single IR for almost all passes, and to avoid IR conversion costs) and I'm unsure we'll have passes that actually need it - but as discussed earlier I am open to data showing otherwise.

This is also what leads to quadratic behavior in the validator. Would it work to integrate the BranchSeeker processing into a Walker that MergeBlocks and Validation can use to get constant-time seeking for each visited block?

Yeah, that's the direction I'm looking at. Good point that this could help the validator too, I'll make sure to include that.

kripken commented 4 years ago

@tlively I'm not sure I see where the validator has quadratic behavior here? It doesn't use BranchUtils or have any walks that look suspicious, and visitBlock looks linear to me. But maybe I'm missing something?

tlively commented 4 years ago

Oh sorry, I got confused between the branch seeker and the refinalizer, which has similar quadratic behavior :(

kripken commented 4 years ago

Hmm, reading the ReFinalizer I also don't see quadratic behavior? It looks like it tracks breaks incrementally as it walks, and visitBlock is linear AFAICT.

However it does look like ReFinalizeNode finalizes a single node in a way that would be quadratic if it were called constantly as part of a walk. I don't think we do that in any important code paths, though. (AutoDrop is the only risky one I see, but that's not normally run.)

tlively commented 4 years ago

I was thinking of the ReFinalizeNode usage in BinaryenIRValidator::visitExpression. On every block, it has the potential to do a full traversal of the subtree via the TypeSeeker in Block::finalize(). I just figured out recently that this code path is only taken if pass debugging in on, though, so that probably doesn't qualify as an important code path.

MaxGraey commented 4 years ago

We find out some issue during build self-hosted assemblyscript. It seems PR #3102 may cause to assert in wasm-traversal.h

kripken commented 4 years ago

@MaxGraey Getting a standalone testcase showing the issue would be good. You should be able to get the IR right before MergeBlocks runs and then run the reducer on that.

MaxGraey commented 4 years ago

I guess we could comment part of passes before first merge-blocks and use wasm-reduce afterwards for reduce module to minimum

tlively commented 4 years ago

@MaxGraey if you run with BINARYEN_PASS_DEBUG=3, you will get files for the state prior to each pass. So you shouldn't have to comment anything out or make any changes to get that intermediate state.

MaxGraey commented 4 years ago

Hopefully @dcodeIO figured out the problem. See https://github.com/emscripten-core/emscripten/issues/12123