Open kripken opened 6 years ago
One case we should test on is binaryen3.test_sqlite
in emscripten: the merge-locals
pass is extremely slow there (dozens of seconds).
I experimented with using a flow analysis, similar to coalesce-locals, in https://github.com/WebAssembly/binaryen/commits/fast-local-graph (see last 3 commits there) but it makes it slower :-o
The issue there is probably that we move around maps of index => set of relevant gets. A small_set class could help a lot there, perhaps.
On my local Linux machine, python tests/runner.py binaryen3.test_sqlite
takes about 25-30s, depending, before erroring out with:
test_sqlite (test_core.binaryen3) ... (checking sanity from test runner)
INFO:root:(Emscripten: Running sanity checks)
<skipping: No JS engine present to run this test with. Check /home/froydnj/.emscripten and the paths therein.> ok
----------------------------------------------------------------------
Ran 1 test in 25.794s
OK
which I think means that it ran all the optimizations appropriate to this bug.
Inspecting the process tree with tracetree says that about half of that time goes to running opt
and about half that time to asm2wasm
. I assume the asm2wasm
part is the part we're concerned about here.
Looking at a perf
profile is frustrating, as I don't think the bits in the SDK are compiled with appropriate options to enable cheap backtraces. A flat profile for the above command says ~20% of the time is doing malloc
/free
. LocalGraph::visitGetLocal
and SimplifyLocals::visitPost
show up taking ~3% of the time each--and that's over the entire run, so roughly 6% each during asm2wasm
(!).
Doing a callgraph profile says that we're spending a lot of time manipulating std::_Rb_tree<SetLocal*, SetLocal*, ...>
instances; I think these correspond to manipulating LocalGraph::getSetses
, LocalGraph::mappingStack
, or LocalGraph::breakMappings
. The breakMappings.erase()
call in LocalGraph::visitBlock
is responsible for a lot of free
traffic, for instance.
So a smaller representation for LocalGraph::Sets
would be a good starting point.
Doing a EM_SAVE_DIR=1
run unfortunately doesn't seem to save the .asm.js
files (?); at least, my /tmp/emscripten_temp
is full of tmp${garbage}.jsfunc_0.js.jo.js.jo.js.jso.js
files, and nothing looking like an .asm.js
file, so it's a bit hard to run the asm2wasm
command on its own...
No JS engine present to run this test with.
Hmm, looks like node and sm bugs limit what vms we run the test on. And testing, it seems like those issues are still active today, so we should leave them...
For your local testing, if you have spidermonkey installed locally, you can add it to ~/.emscripten
which will run the code in at least the optimized modes of that test (some other tests benefit from it too).
which I think means that it ran all the optimizations appropriate to this bug.
You can verify with EMCC_DEBUG=1
in the env. That will log times for each step, and you should see an asm2wasm step which has the binaryen optimizations. It will also save a .temp.asm.js
file - sorry for the confusion about what saves what, perhaps that's somewhat we could improve. (Overall, EM_SAVE_DIR=1
should keep the temp dir alive, but EMCC_DEBUG=1
tells it to emit all temp files.)
Inspecting the process tree with tracetree says that about half of that time goes to running opt and about half that time to asm2wasm.
That's without #1382, I assume? It should improve things substantially at least on sqlite.
So a smaller representation for LocalGraph::Sets would be a good starting point.
Very interesting analysis, thanks. I think most of it remains valid after #1382 since that core data structure is the same. It may make sense to use an unordered set or a sorted vector or something else, but those also have downsides. Probably we should remeasure after that PR lands.
Oh, and thanks for mentioning tracetree
! First I heard of it, looks very useful.
src/ir/local-graph.h
is fairly slow, and we should make it faster. If it were fast enough, we could enableprecompute-propagate
andmerge-locals
in more optimization levels, which would improve code output quality.What LocalGraph does is scan a function and produce a mapping of the sets affecting each get. That is, for each get, the sets that reach that get, that their assigned value may be read there.
This is similar to SSA computation in a way, but simpler (no need to create phis), and there are fairly fast SSA computations on structured control flow like ours, so it seems like this could be fairly fast. Right now the implementation naively carries around big data structures needlessly.