WebAssembly / binaryen

Optimizer and compiler/toolchain library for WebAssembly
Apache License 2.0
7.54k stars 745 forks source link

CoalesceLocals N^2 memory usages of "copies" (affects AddressSanitizer -O3 builds on large wasm file) #3646

Open trzecieu opened 3 years ago

trzecieu commented 3 years ago

Hi,

I've found that wasm2js is getting terminated during execution, After re-compiling wasm2js with address sanitizer suport I've got follwing report. The original wasm was compiled with -O3 -g2 -fsanitize=address (so, wasm with ASAN).

The iteration speed over this issue is pretty slow, getting asan report took 90min of execution, now I'm tryign to get to run with debugger. Wat file is 1GiB size, which makes challenging to process it

While digging through code I've found that this might be already a suspected flaw: https://github.com/WebAssembly/binaryen/blob/760a51bd15a51f02bc1c75087a9cd9e11b9f27bb/src/cfg/liveness-traversal.h#L174-L176

=================================================================
==431669==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f3d3f4fddd9 at pc 0x000000e33332 bp 0x7f40353253b0 sp 0x7f40353253a0
READ of size 1 at 0x7f3d3f4fddd9 thread T5
    #0 0xe33331 in wasm::LivenessWalker<wasm::CoalesceLocals, wasm::Visitor<wasm::CoalesceLocals, void> >::getCopies(unsigned int, unsigned int) (/home/piotr/Projects/binaryen/bin/wasm2js+0xe33331)
    #1 0xe1e0ac in wasm::CoalesceLocals::pickIndicesFromOrder(std::vector<unsigned int, std::allocator<unsigned int> >&, std::vector<unsigned int, std::allocator<unsigned int> >&, unsigned int&) /home/piotr/Projects/binaryen/src/passes/CoalesceLocals.cpp:292
    #2 0xe1ecaf in wasm::CoalesceLocals::pickIndices(std::vector<unsigned int, std::allocator<unsigned int> >&) /home/piotr/Projects/binaryen/src/passes/CoalesceLocals.cpp:331
    #3 0xe1bf28 in wasm::CoalesceLocals::doWalkFunction(wasm::Function*) /home/piotr/Projects/binaryen/src/passes/CoalesceLocals.cpp:102
    #4 0xe5d9e5 in wasm::Walker<wasm::CoalesceLocals, wasm::Visitor<wasm::CoalesceLocals, void> >::walkFunction(wasm::Function*) /home/piotr/Projects/binaryen/src/wasm-traversal.h:558
    #5 0xe5d8ae in wasm::WalkerPass<wasm::LivenessWalker<wasm::CoalesceLocals, wasm::Visitor<wasm::CoalesceLocals, void> > >::runOnFunction(wasm::PassRunner*, wasm::Module*, wasm::Function*) /home/piotr/Projects/binaryen/src/pass.h:338
    #6 0xd76237 in wasm::PassRunner::runPassOnFunction(wasm::Pass*, wasm::Function*) /home/piotr/Projects/binaryen/src/passes/pass.cpp:720
    #7 0xd73dbe in operator() /home/piotr/Projects/binaryen/src/passes/pass.cpp:564
    #8 0xd76e75 in __invoke_impl<wasm::ThreadWorkState, wasm::PassRunner::run()::<lambda()>::<lambda()>&> /usr/include/c++/10/bits/invoke.h:60
    #9 0xd76d16 in __invoke_r<wasm::ThreadWorkState, wasm::PassRunner::run()::<lambda()>::<lambda()>&> /usr/include/c++/10/bits/invoke.h:141
    #10 0xd76a95 in _M_invoke /usr/include/c++/10/bits/std_function.h:291
    #11 0x1664223 in std::function<wasm::ThreadWorkState ()>::operator()() const /usr/include/c++/10/bits/std_function.h:622
    #12 0x166249b in wasm::Thread::mainLoop(void*) /home/piotr/Projects/binaryen/src/support/threads.cpp:86
    #13 0x166872a in void std::__invoke_impl<void, void (*)(void*), wasm::Thread*>(std::__invoke_other, void (*&&)(void*), wasm::Thread*&&) /usr/include/c++/10/bits/invoke.h:60
    #14 0x1668648 in std::__invoke_result<void (*)(void*), wasm::Thread*>::type std::__invoke<void (*)(void*), wasm::Thread*>(void (*&&)(void*), wasm::Thread*&&) /usr/include/c++/10/bits/invoke.h:95
    #15 0x16685b8 in void std::thread::_Invoker<std::tuple<void (*)(void*), wasm::Thread*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) /usr/include/c++/10/thread:264
    #16 0x1668571 in std::thread::_Invoker<std::tuple<void (*)(void*), wasm::Thread*> >::operator()() /usr/include/c++/10/thread:271
    #17 0x1668555 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(void*), wasm::Thread*> > >::_M_run() /usr/include/c++/10/thread:215
    #18 0x7f4042cfaad3  (/lib64/libstdc++.so.6+0xd8ad3)
    #19 0x7f4042aa9431 in start_thread (/lib64/libpthread.so.0+0x9431)
    #20 0x7f40429d76d2 in __clone (/lib64/libc.so.6+0x1016d2)

0x7f3d3f4fddd9 is located 0 bytes to the right of 903644633-byte region [0x7f3d09735800,0x7f3d3f4fddd9)
allocated by thread T5 here:
    #0 0x7f4042ec4067 in operator new(unsigned long) (/lib64/libasan.so.6+0xb2067)
    #1 0xe46b80 in __gnu_cxx::new_allocator<unsigned char>::allocate(unsigned long, void const*) /usr/include/c++/10/ext/new_allocator.h:115
    #2 0xe42cf4 in std::allocator_traits<std::allocator<unsigned char> >::allocate(std::allocator<unsigned char>&, unsigned long) /usr/include/c++/10/bits/alloc_traits.h:460
    #3 0xe3eaa5 in std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate(unsigned long) /usr/include/c++/10/bits/stl_vector.h:346
    #4 0xe3950d in std::vector<unsigned char, std::allocator<unsigned char> >::_M_default_append(unsigned long) (/home/piotr/Projects/binaryen/bin/wasm2js+0xe3950d)
    #5 0xe32d4c in std::vector<unsigned char, std::allocator<unsigned char> >::resize(unsigned long) /usr/include/c++/10/bits/stl_vector.h:940
    #6 0xe311ff in wasm::LivenessWalker<wasm::CoalesceLocals, wasm::Visitor<wasm::CoalesceLocals, void> >::doWalkFunction(wasm::Function*) (/home/piotr/Projects/binaryen/bin/wasm2js+0xe311ff)
    #7 0xe1be96 in wasm::CoalesceLocals::doWalkFunction(wasm::Function*) /home/piotr/Projects/binaryen/src/passes/CoalesceLocals.cpp:95
    #8 0xe5d9e5 in wasm::Walker<wasm::CoalesceLocals, wasm::Visitor<wasm::CoalesceLocals, void> >::walkFunction(wasm::Function*) /home/piotr/Projects/binaryen/src/wasm-traversal.h:558
    #9 0xe5d8ae in wasm::WalkerPass<wasm::LivenessWalker<wasm::CoalesceLocals, wasm::Visitor<wasm::CoalesceLocals, void> > >::runOnFunction(wasm::PassRunner*, wasm::Module*, wasm::Function*) /home/piotr/Projects/binaryen/src/pass.h:338
    #10 0xd76237 in wasm::PassRunner::runPassOnFunction(wasm::Pass*, wasm::Function*) /home/piotr/Projects/binaryen/src/passes/pass.cpp:720
    #11 0xd73dbe in operator() /home/piotr/Projects/binaryen/src/passes/pass.cpp:564
    #12 0xd76e75 in __invoke_impl<wasm::ThreadWorkState, wasm::PassRunner::run()::<lambda()>::<lambda()>&> /usr/include/c++/10/bits/invoke.h:60
    #13 0xd76d16 in __invoke_r<wasm::ThreadWorkState, wasm::PassRunner::run()::<lambda()>::<lambda()>&> /usr/include/c++/10/bits/invoke.h:141
    #14 0xd76a95 in _M_invoke /usr/include/c++/10/bits/std_function.h:291
    #15 0x1664223 in std::function<wasm::ThreadWorkState ()>::operator()() const /usr/include/c++/10/bits/std_function.h:622
    #16 0x166249b in wasm::Thread::mainLoop(void*) /home/piotr/Projects/binaryen/src/support/threads.cpp:86
    #17 0x166872a in void std::__invoke_impl<void, void (*)(void*), wasm::Thread*>(std::__invoke_other, void (*&&)(void*), wasm::Thread*&&) /usr/include/c++/10/bits/invoke.h:60
    #18 0x1668648 in std::__invoke_result<void (*)(void*), wasm::Thread*>::type std::__invoke<void (*)(void*), wasm::Thread*>(void (*&&)(void*), wasm::Thread*&&) /usr/include/c++/10/bits/invoke.h:95
    #19 0x16685b8 in void std::thread::_Invoker<std::tuple<void (*)(void*), wasm::Thread*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) /usr/include/c++/10/thread:264
    #20 0x1668571 in std::thread::_Invoker<std::tuple<void (*)(void*), wasm::Thread*> >::operator()() /usr/include/c++/10/thread:271
    #21 0x1668555 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(void*), wasm::Thread*> > >::_M_run() /usr/include/c++/10/thread:215
    #22 0x7f4042cfaad3  (/lib64/libstdc++.so.6+0xd8ad3)

Thread T5 created by T0 here:
    #0 0x7f4042e69be5 in __interceptor_pthread_create (/lib64/libasan.so.6+0x57be5)
    #1 0x7f4042cfad98 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib64/libstdc++.so.6+0xd8d98)
    #2 0x1663fdc in std::unique_ptr<std::thread, std::default_delete<std::thread> > wasm::make_unique<std::thread, void (&)(void*), wasm::Thread*>(void (&)(void*), wasm::Thread*&&) /home/piotr/Projects/binaryen/src/support/utilities.h:57
    #3 0x1662014 in wasm::Thread::Thread(wasm::ThreadPool*) /home/piotr/Projects/binaryen/src/support/threads.cpp:52
    #4 0x166440b in std::unique_ptr<wasm::Thread, std::default_delete<wasm::Thread> > wasm::make_unique<wasm::Thread, wasm::ThreadPool*>(wasm::ThreadPool*&&) /home/piotr/Projects/binaryen/src/support/utilities.h:57
    #5 0x1662920 in wasm::ThreadPool::initialize(unsigned long) /home/piotr/Projects/binaryen/src/support/threads.cpp:127
    #6 0x1662ec7 in wasm::ThreadPool::get() /home/piotr/Projects/binaryen/src/support/threads.cpp:160
    #7 0xd73fc8 in operator() /home/piotr/Projects/binaryen/src/passes/pass.cpp:548
    #8 0xd7575c in wasm::PassRunner::run() /home/piotr/Projects/binaryen/src/passes/pass.cpp:585
    #9 0x158f05a in wasm::WalkerPass<wasm::PostWalker<wasm::FunctionValidator, wasm::Visitor<wasm::FunctionValidator, void> > >::run(wasm::PassRunner*, wasm::Module*) /home/piotr/Projects/binaryen/src/pass.h:325
    #10 0x157d59f in wasm::WasmValidator::validate(wasm::Module&, unsigned int) /home/piotr/Projects/binaryen/src/wasm/wasm-validator.cpp:2145
    #11 0xc817ea in main /home/piotr/Projects/binaryen/src/tools/wasm2js.cpp:912
    #12 0x7f40428fd081 in __libc_start_main (/lib64/libc.so.6+0x27081)

SUMMARY: AddressSanitizer: heap-buffer-overflow (/home/piotr/Projects/binaryen/bin/wasm2js+0xe33331) in wasm::LivenessWalker<wasm::CoalesceLocals, wasm::Visitor<wasm::CoalesceLocals, void> >::getCopies(unsigned int, unsigned int)
Shadow bytes around the buggy address:
  0x0fe827e97b60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0fe827e97b70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0fe827e97b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0fe827e97b90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0fe827e97ba0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0fe827e97bb0: 00 00 00 00 00 00 00 00 00 00 00[01]fa fa fa fa
  0x0fe827e97bc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe827e97bd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe827e97be0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe827e97bf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0fe827e97c00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==431669==ABORTING
kripken commented 3 years ago

This isn't a known issue. It should fail to allocate (and halt), or it should hit the limit in the code you quote and avoid running the analysis. So that it runs it and segfaults suggest were are missing a check somehow. But reading the code for copies it is allocated properly AFAICT.

Do these assertions catch anything?

diff --git a/src/cfg/liveness-traversal.h b/src/cfg/liveness-traversal.h
index 6deab2fd6..7eec02682 100644
--- a/src/cfg/liveness-traversal.h
+++ b/src/cfg/liveness-traversal.h
@@ -277,6 +277,7 @@ struct LivenessWalker : public CFGWalker<SubType, VisitorType, Liveness> {
   }

   void addCopy(Index i, Index j) {
+    assert(i < numLocals && j < numLocals);
     auto k = std::min(i, j) * numLocals + std::max(i, j);
     copies[k] = std::min(copies[k], uint8_t(254)) + 1;
     totalCopies[i]++;
@@ -284,6 +285,7 @@ struct LivenessWalker : public CFGWalker<SubType, VisitorType, Liveness> {
   }

   uint8_t getCopies(Index i, Index j) {
+    assert(i < numLocals && j < numLocals);
     return copies[std::min(i, j) * numLocals + std::max(i, j)];
   }
 };

Looking at the line numbers in the traces, they don't match anything reasonable in current main or in the hash mentioned in the link. Which specific version was this compiled from?

trzecieu commented 3 years ago

This time when I've tested modified binaryen I've got segmentation fault problem. Interestingly, wasm2js didn't fail on my other machine, that gave me a point of view that hardware limitation could be the problem. And indeed. wasm2js run out of RAM (probably not memory leak, but growing vector totalCopies).

While tracking memory consumption on another configuration I've found that processing 90MiB wasm file requires 40GiB of RAM in peak, which exceeds the original machine system memory (32GiB) and crash takes a place when last measurement point was 28GiB consumed by wasm2js.

Here is a chart of memory usage over time for a system with 64GiB of RAM image

kripken commented 3 years ago

Ok, definitely sounds like running out of memory then. I'm surprised it doesn't report an error, but perhaps while things run in parallel it runs out of memory in another pass and hits an unhandled code path somehow.

We should probably reduce the limit here: https://github.com/WebAssembly/binaryen/blob/760a51bd15a51f02bc1c75087a9cd9e11b9f27bb/src/cfg/liveness-traversal.h#L179-L182

Right now that stops at 2^16 locals which means a matrix of size 2^32 of 1 byte, so 2GB. When running in parallel we can use 2GB per core, and so if you have 20 cores you can each 40GB as in that screenshot.

To confirm that's the issue, does running with say BINARYEN_CORES=4 in the environment fix things? (then it should use just 4 cores, and less than 8GB)

trzecieu commented 3 years ago

Hi, the original test was performed on machine with 4 cores with HT, so 8 logical cores. When I limited execution with BINARYEN_CORES=4 and made sure that only 4 workers are spawned I've hit the memory limit. A little bit later tho. In general it looks like one particular section of code is problematic to wasm2js to handle, as memory spikes pretty drastically there: image

The build is pretty unusual, and I suspect that ASAN puts a number of locals to the wasm binary, that later wasm2js struggles to handle. I will try to process *.wat file to see some outstanding functions. Is there something that I should look closely in particular?

kripken commented 3 years ago

Hmm, it may be due to flattening, which increases the number of locals. Looking at the local numbers might be interesting.

Running with BINARYEN_PASS_DEBUG=1 in the env might help narrow down where the oom happens too.

trzecieu commented 3 years ago

It took significantly more time now (4h30m) and resulted also with termination plus following log:

[PassRunner] running passes...
[PassRunner]   running pass: autodrop...                          46.8855 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: legalize-js-interface...             1.79304 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-non-js-ops...                 9.32078 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: flatten...                           242.145 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: i64-to-i32-lowering...               247.078 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: alignment-lowering...                25.0054 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nonesting...         571.282 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute-propagate...              1598.1 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: avoid-reinterprets...                684.118 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination...    31.4696 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: ssa-nomerge...                       772.552 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                               149.285 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...                 196.556 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...               22.9558 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...             69.3187 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: pick-load-signs...                   47.6139 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                        52.1255 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-pushing...                      95.9126 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...       578.949 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                            201.927 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                    404.637 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...                 65.9696 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                   283.747 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals...                   297.147 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                            98.7764 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                    25.6809 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                   145.851 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                    25.5365 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                            82.2849 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-folding...                      64.0878 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                      78.4132 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...                 59.0452 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...               10.5026 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                      52.6485 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                        24.882 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...             48.116 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: rse...                               74.6309 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                            81.3507 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dae-optimizing...                    19.1415 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: inlining-optimizing...               16.7154 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination...    23.8512 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-import-elimination...      0.0578341 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-globals-optimizing...       6.84558 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-module-elements...     21.3974 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                    0.0418483 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: directize...                         0.0561223 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: generate-stack-ir...                 5.07717 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-stack-ir...                 303.577 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: avoid-reinterprets...                193.407 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: flatten...                           227.296 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-notee-nostructure... 452.76 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...               20.9014 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                      % 
kripken commented 3 years ago

I suspect it is much slower in that mode to much heavier validation. --no-validation might make it a lot faster while still showing the bug, assuming the bug is not related to validation.

Interesting, so it fails in merge-blocks? Is that consistent each time you run it, or maybe it is random?

trzecieu commented 3 years ago

Looks random, this time:

[PassRunner] running passes...
[PassRunner]   running pass: autodrop...                          46.0485 seconds.
[PassRunner]   running pass: legalize-js-interface...             2.28625 seconds.
[PassRunner]   running pass: remove-non-js-ops...                 9.87816 seconds.
[PassRunner]   running pass: flatten...                           259.985 seconds.
[PassRunner]   running pass: i64-to-i32-lowering...               247.905 seconds.
[PassRunner]   running pass: alignment-lowering...                24.9692 seconds.
[PassRunner]   running pass: simplify-locals-nonesting...         567.913 seconds.
[PassRunner]   running pass: precompute-propagate...              1555.96 seconds.
[PassRunner]   running pass: avoid-reinterprets...                712.264 seconds.
[PassRunner]   running pass: duplicate-function-elimination...    31.4762 seconds.
[PassRunner]   running pass: ssa-nomerge...                       771.898 seconds.
[PassRunner]   running pass: dce...                               149.456 seconds.
[PassRunner]   running pass: remove-unused-brs...                 185.211 seconds.
[PassRunner]   running pass: remove-unused-names...               22.9299 seconds.
[PassRunner]   running pass: optimize-instructions...             69.0268 seconds.
[PassRunner]   running pass: pick-load-signs...                   47.5824 seconds.
[PassRunner]   running pass: precompute...                        51.9369 seconds.
[PassRunner]   running pass: code-pushing...                      95.9054 seconds.
[PassRunner]   running pass: simplify-locals-nostructure...       576.945 seconds.
[PassRunner]   running pass: vacuum...                            201.944 seconds.
[PassRunner]   running pass: reorder-locals...                    404.3 seconds.
[PassRunner]   running pass: remove-unused-brs...                 65.8322 seconds.
[PassRunner]   running pass: coalesce-locals...                   255.804 seconds.
[PassRunner]   running pass: simplify-locals...                   304.56 seconds.
[PassRunner]   running pass: vacuum...                            113.623 seconds.
[PassRunner]   running pass: reorder-locals...                    28.1115 seconds.
[PassRunner]   running pass: coalesce-locals...                   151.807 seconds.
[PassRunner]   running pass: reorder-locals...                    25.449 seconds.
[PassRunner]   running pass: vacuum...                            82.1876 seconds.
[PassRunner]   running pass: code-folding...                      63.9698 seconds.
[PassRunner]   running pass: merge-blocks...                      79.9715 seconds.
[PassRunner]   running pass: remove-unused-brs...                 58.7302 seconds.
[PassRunner]   running pass: remove-unused-names...               10.4656 seconds.
[PassRunner]   running pass: merge-blocks...                      52.3901 seconds.
[PassRunner]   running pass: precompute...                        24.8123 seconds.
[PassRunner]   running pass: optimize-instructions...             47.8441 seconds.
[PassRunner]   running pass: rse...                               74.478 seconds.
[PassRunner]   running pass: vacuum...                            81.0584 seconds.
[PassRunner]   running pass: dae-optimizing...                    19.1527 seconds.
[PassRunner]   running pass: inlining-optimizing...               16.7058 seconds.
[PassRunner]   running pass: duplicate-function-elimination...    23.8856 seconds.
[PassRunner]   running pass: duplicate-import-elimination...      0.0457912 seconds.
[PassRunner]   running pass: simplify-globals-optimizing...       6.81031 seconds.
[PassRunner]   running pass: remove-unused-module-elements...     21.4296 seconds.
[PassRunner]   running pass: memory-packing...                    0.0207571 seconds.
[PassRunner]   running pass: directize...                         0.0281063 seconds.
[PassRunner]   running pass: generate-stack-ir...                 4.5748 seconds.
[PassRunner]   running pass: optimize-stack-ir...                 293.079 seconds.
[PassRunner]   running pass: avoid-reinterprets...                193.185 seconds.
[PassRunner]   running pass: flatten...                           230.451 seconds.
[PassRunner]   running pass: simplify-locals-notee-nostructure... 452.479 seconds.
[PassRunner]   running pass: remove-unused-names...               20.5806 seconds.
[PassRunner]   running pass: merge-blocks...                      232.503 seconds.
[PassRunner]   running pass: coalesce-locals...                   % 
kripken commented 3 years ago

Sounds like a general memory usage issue then. In general I'm not sure what we can do, as this combination of options plus your input seems to just create incredibly large output. The only thing might be to add lower limits for certain things, but it's hard to decide where to put those limits.

Note that -O3 is a very high optimization level. Does this work with -O1 perhaps? That gets most of the optimizations with a lot less work.

trzecieu commented 3 years ago

Yes, this is not a blocker issue, more like problem exploration. The build is indeed not common -O3 -g2 -fsanitize=address`, and making optimization level smaller or remove ASAN reduces memory consumption.

At some moment there is a code that resizes a copies vector by the numLocals^2 - this is the place that potentially is problematic. Is there any room to optimize that?

Nonetheless, thank you kindly for helping debugging that. I'm leaving to you whether this issue can be closed or not.

Cheers,

kripken commented 3 years ago

That N^2 matrix could be a sparse matrix. But that would bring its own tradeoffs with slower access times.

The entries there are 8 bit, allowing us to count up to 255 copies between locals. That is probably too high for most use cases. So we might save a factor of 2 or 3 in memory there, perhaps, but not more, so that won't fully solve this.

Or we could limit N to something even lower than 2^16 which we have now. But such arbitrary limits are troubling, as there will be users that would prefer to let their machine with huge memory process their huge project. And letting users apply BINARYEN_CORES to limit memory usage is a workaround for the current situation for people with huge projects but not huge machines (though, it is not obvious...).

Overall I'm not sure how to improve this. Open to ideas!

And thanks for filing @trzecieu , hopefully we can find a good improvement. Leaving open for that.