WebAssembly / binaryen

Optimizer and compiler/toolchain library for WebAssembly
Apache License 2.0
7.5k stars 742 forks source link

Speed up LocalGraph #1338

Open kripken opened 6 years ago

kripken commented 6 years ago

src/ir/local-graph.h is fairly slow, and we should make it faster. If it were fast enough, we could enable precompute-propagate and merge-locals in more optimization levels, which would improve code output quality.

What LocalGraph does is scan a function and produce a mapping of the sets affecting each get. That is, for each get, the sets that reach that get, that their assigned value may be read there.

This is similar to SSA computation in a way, but simpler (no need to create phis), and there are fairly fast SSA computations on structured control flow like ours, so it seems like this could be fairly fast. Right now the implementation naively carries around big data structures needlessly.

kripken commented 6 years ago

One case we should test on is binaryen3.test_sqlite in emscripten: the merge-locals pass is extremely slow there (dozens of seconds).

kripken commented 6 years ago

I experimented with using a flow analysis, similar to coalesce-locals, in https://github.com/WebAssembly/binaryen/commits/fast-local-graph (see last 3 commits there) but it makes it slower :-o

The issue there is probably that we move around maps of index => set of relevant gets. A small_set class could help a lot there, perhaps.

froydnj commented 6 years ago

On my local Linux machine, python tests/runner.py binaryen3.test_sqlite takes about 25-30s, depending, before erroring out with:

test_sqlite (test_core.binaryen3) ... (checking sanity from test runner)
INFO:root:(Emscripten: Running sanity checks)
<skipping: No JS engine present to run this test with. Check /home/froydnj/.emscripten and the paths therein.>  ok

----------------------------------------------------------------------
Ran 1 test in 25.794s

OK

which I think means that it ran all the optimizations appropriate to this bug.

Inspecting the process tree with tracetree says that about half of that time goes to running opt and about half that time to asm2wasm. I assume the asm2wasm part is the part we're concerned about here.

Looking at a perf profile is frustrating, as I don't think the bits in the SDK are compiled with appropriate options to enable cheap backtraces. A flat profile for the above command says ~20% of the time is doing malloc/free. LocalGraph::visitGetLocal and SimplifyLocals::visitPost show up taking ~3% of the time each--and that's over the entire run, so roughly 6% each during asm2wasm (!).

Doing a callgraph profile says that we're spending a lot of time manipulating std::_Rb_tree<SetLocal*, SetLocal*, ...> instances; I think these correspond to manipulating LocalGraph::getSetses, LocalGraph::mappingStack, or LocalGraph::breakMappings. The breakMappings.erase() call in LocalGraph::visitBlock is responsible for a lot of free traffic, for instance.

So a smaller representation for LocalGraph::Sets would be a good starting point.

froydnj commented 6 years ago

Doing a EM_SAVE_DIR=1 run unfortunately doesn't seem to save the .asm.js files (?); at least, my /tmp/emscripten_temp is full of tmp${garbage}.jsfunc_0.js.jo.js.jo.js.jso.js files, and nothing looking like an .asm.js file, so it's a bit hard to run the asm2wasm command on its own...

kripken commented 6 years ago

No JS engine present to run this test with.

Hmm, looks like node and sm bugs limit what vms we run the test on. And testing, it seems like those issues are still active today, so we should leave them...

For your local testing, if you have spidermonkey installed locally, you can add it to ~/.emscripten which will run the code in at least the optimized modes of that test (some other tests benefit from it too).

which I think means that it ran all the optimizations appropriate to this bug.

You can verify with EMCC_DEBUG=1 in the env. That will log times for each step, and you should see an asm2wasm step which has the binaryen optimizations. It will also save a .temp.asm.js file - sorry for the confusion about what saves what, perhaps that's somewhat we could improve. (Overall, EM_SAVE_DIR=1 should keep the temp dir alive, but EMCC_DEBUG=1 tells it to emit all temp files.)

Inspecting the process tree with tracetree says that about half of that time goes to running opt and about half that time to asm2wasm.

That's without #1382, I assume? It should improve things substantially at least on sqlite.

So a smaller representation for LocalGraph::Sets would be a good starting point.

Very interesting analysis, thanks. I think most of it remains valid after #1382 since that core data structure is the same. It may make sense to use an unordered set or a sorted vector or something else, but those also have downsides. Probably we should remeasure after that PR lands.

kripken commented 6 years ago

Oh, and thanks for mentioning tracetree! First I heard of it, looks very useful.