WebAssembly / binaryen

Optimizer and compiler/toolchain library for WebAssembly
Apache License 2.0
7.49k stars 741 forks source link

Platform-specific performance differences when running `wasm-opt`. #4650

Closed MichaelMauderer closed 2 years ago

MichaelMauderer commented 2 years ago

It seems that wasm-opt on my machine has serious differences in performance when running either on Windows 10 or Manjaro.

I'm using wasm-opt through wasm-pack on a non-trivial Rust project. On Windows, the wasm-opt pass takes 133.157 seconds, on the same machine, different partition running Manjaro the same compile takes 728.146 seconds.

This is the output when setting BINARYEN_PASS_DEBUG=1 for Windows

[INFO]: Optimizing wasm binaries with `wasm-opt`...
[PassRunner] running passes...
[PassRunner]   running pass: duplicate-function-elimination... 2.88832 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: ssa-nomerge...                    4.42434 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                            2.20266 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.79726 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.378617 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          2.15488 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: pick-load-signs...                0.507949 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     1.10151 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-pushing...                   0.621 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...    14.536 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.70013 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 1.83696 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.44841 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                4.3522 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals...                20.0279 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.56973 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.759905 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                2.93869 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.749045 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.54552 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-folding...                   3.3096 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   3.09386 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.6268 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.379605 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   2.76402 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     1.05702 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          1.8834 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: rse...                            1.87566 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.70798 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dae-optimizing...                 4.21961 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: inlining-optimizing...            29.0754 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination... 0.859034 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-import-elimination...   0.0019804 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-globals-optimizing...    0.438612 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-module-elements...  0.50541 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                 0.007581 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: directize...                      0.0013951 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: generate-stack-ir...              0.225153 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-stack-ir...              8.58397 seconds.
[PassRunner]   (validating)
[PassRunner] passes took 133.157 seconds.
[PassRunner] (final validation)

and on Manjaro

[INFO]: Optimizing wasm binaries with `wasm-opt`...
[PassRunner] running passes...
[PassRunner]   running pass: duplicate-function-elimination... 112.779 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: ssa-nomerge...                    5.35414 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                            2.2106 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.45521 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.324388 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          1.67558 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: pick-load-signs...                0.392611 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     0.992838 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-pushing...                   0.482472 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...    3.42772 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.13702 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.992806 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.19007 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.65649 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals...                4.37027 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.10852 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.548347 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.24067 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.544719 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.08809 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-folding...                   3.35384 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.68522 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.43297 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.36459 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.48835 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     0.980593 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          1.55297 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: rse...                            2.48903 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.25443 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dae-optimizing...                 120.558 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: inlining-optimizing...            382.379 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination... 44.4687 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-import-elimination...   0.0203127 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-globals-optimizing...    5.31954 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-module-elements...  0.476476 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                 0.028793 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: directize...                      0.0149405 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: generate-stack-ir...              0.240264 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-stack-ir...              9.0649 seconds.
[PassRunner]   (validating)
[PassRunner] passes took 728.146 seconds.
[PassRunner] (final validation)

The specs of the machine used: Processor AMD Ryzen Threadripper 2950X 16-Core Processor 3.50 GHz Installed RAM 64.0 GB (63.9 GB usable)

Is there a known reason for this performance difference? My preferred work environment is Manjaro, but the time it takes to run wasm-opt there is killing my workflow, and I do need to work with the fully optimized output at the moment.

kripken commented 2 years ago

Wow, that orders of magnitude difference (3 to 112 seconds) on duplicate-function-elimination is very shocking... at least it should be easy to debug since it's so big (and also convenient it is the first pass to run).

I would try using the linux perf command on it (perf record wasm-opt [..], stop it after that pass, then perf report). That will hopefully show something useful.

One possible guess is that that pass hashes function contents. Perhaps the hash function in the STL on that machine is very slow, somehow? We do depend on system libraries there, so system differences can matter.

Another thing to check is whether the slow build is a debug build - maybe wasm-pack builds it at -O0 by mistake? But I wouldn't expect that to cause such a big difference on a specific pass and not everything.

Finally, I'd make sure the machine has enough memory, as maybe that pass uses more than others, and it starts to swap. Can try to build with BINARYEN_CORES=1 or 2 or 4, etc., which will limit parallelism but also save memory (if there are more cores).

Relevant to #4165

MichaelMauderer commented 2 years ago

It seems that was-pack ships a wasm-opt without debug symbols. So, not yet anything useful to report from perf until I can test another build of wasm-opt that has debug symbols.

But just setting BINARYEN_CORES=1 makes a big difference:

[INFO]: Optimizing wasm binaries with `wasm-opt`...
[PassRunner] running passes...
[PassRunner]   running pass: duplicate-function-elimination... 17.8452 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: ssa-nomerge...                    5.18185 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                            2.21731 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.53555 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.349529 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          1.74303 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: pick-load-signs...                0.409768 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     0.935715 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-pushing...                   0.516302 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...    3.30613 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.22659 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.994874 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.2351 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.84092 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals...                4.50136 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         1.99572 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.563741 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.5434 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.587189 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         1.97074 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-folding...                   3.17574 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.54356 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.37743 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.339822 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.26992 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     0.919477 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          1.28952 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: rse...                            2.12282 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.10949 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dae-optimizing...                 19.0256 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: inlining-optimizing...            96.7644 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination... 6.48776 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-import-elimination...   0.0143792 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-globals-optimizing...    0.841014 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-module-elements...  0.471183 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                 0.0210139 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: directize...                      0.0128064 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: generate-stack-ir...              0.235107 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-stack-ir...              8.36705 seconds.
[PassRunner]   (validating)
[PassRunner] passes took 201.888 seconds.

vs BINARYEN_CORES=16

[INFO]: Optimizing wasm binaries with `wasm-opt`...
[PassRunner] running passes...
[PassRunner]   running pass: duplicate-function-elimination... 112.663 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: ssa-nomerge...                    5.85834 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                            2.43297 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.54267 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.365984 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          1.59643 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: pick-load-signs...                0.41725 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     0.992736 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-pushing...                   0.493042 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...    3.78044 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.45175 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 1.04343 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.21475 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                4.07825 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals...                4.50737 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.14361 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.595855 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.85174 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.588754 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.13316 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-folding...                   3.28297 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.65314 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.41555 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.373972 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.35225 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     0.975925 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          1.44795 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: rse...                            2.71446 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         2.3204 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dae-optimizing...                 121.434 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: inlining-optimizing...            374.782 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination... 45.7073 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-import-elimination...   0.00285077 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-globals-optimizing...    5.20232 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-module-elements...  0.479619 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                 0.00903188 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: directize...                      0.00209986 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: generate-stack-ir...              0.269585 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-stack-ir...              9.84304 seconds.
[PassRunner]   (validating)
[PassRunner] passes took 726.02 seconds.

Memory does not seem to an issue, though. It never seems to go even beyond 1GB used in either case.

wdanilo commented 2 years ago

An interesting thing is that we are also testing that on macOS and the same code that @MichaelMauderer is posting about is taking 30-45 seconds (!!!) on M1 macOS, which is crazily fast in comparison to Linux and Windows machines.

MichaelMauderer commented 2 years ago

So maybe this is actually related to the version of wasm-opt that is used in wasm-bindgen

Using a newly built version from this repo I see much more reasonable results, where more cores are faster at least. These are both from Manjaro.

BINARYEN_CORES=16

[INFO]: Optimizing wasm binaries with `wasm-opt`...
[PassRunner] running passes
[PassRunner]   running pass: duplicate-function-elimination... 2.15518 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                 0.727007 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: once-reduction...                 0.804615 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: ssa-nomerge...                    4.66156 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                            2.29155 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            1.03365 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              2.44295 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            1.0565 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          2.56435 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: pick-load-signs...                1.13374 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     2.02423 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-pushing...                   1.22919 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...    3.97407 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.37542 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 1.29458 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              2.11653 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.59432 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: local-cse...                      2.11877 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals...                5.19623 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.09495 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 1.22572 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.22234 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 1.22349 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.21541 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-folding...                   3.33701 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.75992 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              2.36783 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            1.12842 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.65416 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     1.98178 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          2.43883 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: rse...                            2.54077 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.36713 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dae-optimizing...                 4.46666 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: inlining-optimizing...            105.261 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination... 1.15851 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-import-elimination...   0.691016 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-globals-optimizing...    0.913862 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-module-elements...  1.12312 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: directize...                      0.755405 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: generate-stack-ir...              1.01954 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-stack-ir...              10.5984 seconds.
[PassRunner]   (validating)
[PassRunner] passes took 202.34 seconds.
[PassRunner] (final validation)
 11.98%  wasm-opt      libbinaryen.so                       [.] wasm::validateBinaryenIR(wasm::Module&, wasm::ValidationInfo&)::BinaryenIRValidator::visitExpression                                                                                    ◆
   7.90%  wasm-opt      libbinaryen.so                       [.] std::_Hashtable<unsigned long, std::pair<unsigned long const, bool>, std::allocator<std::pair<unsigned long const, bool> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std▒
   5.82%  wasm-opt      libbinaryen.so                       [.] std::_Hashtable<wasm::Expression*, wasm::Expression*, std::allocator<wasm::Expression*>, std::__detail::_Identity, std::equal_to<wasm::Expression*>, std::hash<wasm::Expression*>, std::▒
   4.16%  wasm-opt      libbinaryen.so                       [.] wasm::PostWalker<wasm::BranchUtils::BranchSeeker, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::BranchSeeker, void> >::scan                                                         ▒
   2.89%  wasm-opt      libbinaryen.so                       [.] wasm::validateBinaryenIR                                                                                                                                                                ▒
   2.77%  wasm-opt      libc.so.6                            [.] unlink_chunk.constprop.0                                                                                                                                                                ▒
   2.61%  wasm-opt      libc.so.6                            [.] malloc_consolidate                                                                                                                                                                      ▒
   2.30%  wasm-opt      libc.so.6                            [.] _int_malloc                                                                                                                                                                             ▒
   2.10%  wasm-opt      libbinaryen.so                       [.] wasm::Block::finalize                                                                                                                                                                   ▒
   2.10%  wasm-opt      libbinaryen.so                       [.] std::_Hashtable<unsigned long, std::pair<unsigned long const, unsigned char>, std::allocator<std::pair<unsigned long const, unsigned char> >, std::__detail::_Select1st, std::equal_to<u▒
   1.92%  wasm-opt      libbinaryen.so                       [.] std::vector<wasm::Walker<wasm::BranchUtils::BranchSeeker, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::BranchSeeker, void> >::Task, std::allocator<wasm::Walker<wasm::BranchUtils::▒
   1.70%  wasm-opt      libbinaryen.so                       [.] wasm::CoalesceLocals::pickIndicesFromOrder                                                                                                                                              ▒
   1.66%  wasm-opt      libc.so.6                            [.] __libc_malloc                                                                                                                                                                           ▒
   1.55%  wasm-opt      libc.so.6                            [.] _int_free                                                                                                                                                                               ▒
   1.45%  wasm-opt      libc.so.6                            [.] __strcmp_avx2                                                                                                                                                                           ▒
   0.93%  wasm-opt      libbinaryen.so                       [.] wasm::LocalGraphInternal::Flower::flow                                                                                                                                                  ▒
   0.93%  wasm-opt      libbinaryen.so                       [.] wasm::PostWalker<wasm::BranchUtils::getBranchTargets(wasm::Expression*)::Scanner, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::getBranchTargets(wasm::Expression*)::Scanner, void> ▒
   0.84%  wasm-opt      libc.so.6                            [.] __libc_free                                                                                                                                                                             ▒
   0.76%  wasm-opt      libstdc++.so.6.0.29                  [.] std::_Rb_tree_insert_and_rebalance                                                                                                                                                      ▒
   0.75%  wasm-opt      libbinaryen.so                       [.] wasm::Walker<wasm::BranchUtils::BranchSeeker, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::BranchSeeker, void> >::pushTask                                                         ▒
   0.69%  wasm-opt      libbinaryen.so                       [.] wasm::TypeUpdater::visitExpression                                                                                                                                                      ▒
   0.68%  wasm-opt      libbinaryen.so                       [.] wasm::BranchUtils::operateOnScopeNameUses<wasm::BranchUtils::operateOnScopeNameUsesAndSentTypes<wasm::BranchUtils::BranchSeeker::visitExpression(wasm::Expression*)::{lambda(wasm::Name&▒
   0.62%  wasm-opt      libbinaryen.so                       [.] std::__detail::_Map_base<unsigned long, std::pair<unsigned long const, unsigned char>, std::allocator<std::pair<unsigned long const, unsigned char> >, std::__detail::_Select1st, std::e▒
   0.58%  wasm-opt      libbinaryen.so                       [.] wasm::PostWalker<wasm::validateBinaryenIR(wasm::Module&, wasm::ValidationInfo&)::BinaryenIRValidator, wasm::UnifiedExpressionVisitor<wasm::validateBinaryenIR(wasm::Module&, wasm::Valid▒

BINARYEN_CORES=1

[INFO]: Optimizing wasm binaries with `wasm-opt`...
[PassRunner] running passes
[PassRunner]   running pass: duplicate-function-elimination... 5.94185 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                 0.699456 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: once-reduction...                 0.972953 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: ssa-nomerge...                    4.47582 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                            2.26037 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            1.02758 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              2.38387 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            1.06008 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          2.56524 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: pick-load-signs...                1.12496 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     1.9688 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-pushing...                   1.19446 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...    3.924 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.23851 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 1.28146 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              2.02873 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.53683 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: local-cse...                      2.13155 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals...                5.09726 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.07781 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 1.17475 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                2.94784 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 1.1603 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.03909 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-folding...                   3.04993 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.65812 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              2.24237 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            1.0186 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.57216 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     1.95945 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          2.35517 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: rse...                            2.45102 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.11829 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dae-optimizing...                 27.7428 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: inlining-optimizing...            314.137 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination... 2.75518 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-import-elimination...   0.686738 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-globals-optimizing...    2.14635 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-module-elements...  1.08174 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: directize...                      0.925092 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: generate-stack-ir...              0.990785 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-stack-ir...              10.6958 seconds.
[PassRunner]   (validating)
[PassRunner] passes took 438.9 seconds.
[PassRunner] (final validation)
Overhead  Command       Shared Object                        Symbol
  12.09%  wasm-opt      libbinaryen.so                       [.] wasm::validateBinaryenIR(wasm::Module&, wasm::ValidationInfo&)::BinaryenIRValidator::visitExpression                                                                                    ◆
   7.76%  wasm-opt      libbinaryen.so                       [.] std::_Hashtable<unsigned long, std::pair<unsigned long const, bool>, std::allocator<std::pair<unsigned long const, bool> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std▒
   5.87%  wasm-opt      libbinaryen.so                       [.] std::_Hashtable<wasm::Expression*, wasm::Expression*, std::allocator<wasm::Expression*>, std::__detail::_Identity, std::equal_to<wasm::Expression*>, std::hash<wasm::Expression*>, std::▒
   4.12%  wasm-opt      libbinaryen.so                       [.] wasm::PostWalker<wasm::BranchUtils::BranchSeeker, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::BranchSeeker, void> >::scan                                                         ▒
   3.84%  wasm-opt      libc.so.6                            [.] _int_malloc                                                                                                                                                                             ▒
   3.59%  wasm-opt      libc.so.6                            [.] unlink_chunk.constprop.0                                                                                                                                                                ▒
   3.03%  wasm-opt      libc.so.6                            [.] malloc_consolidate                                                                                                                                                                      ▒
   2.99%  wasm-opt      libbinaryen.so                       [.] wasm::validateBinaryenIR                                                                                                                                                                ▒
   2.16%  wasm-opt      libbinaryen.so                       [.] wasm::Block::finalize                                                                                                                                                                   ▒
   2.08%  wasm-opt      libbinaryen.so                       [.] std::_Hashtable<unsigned long, std::pair<unsigned long const, unsigned char>, std::allocator<std::pair<unsigned long const, unsigned char> >, std::__detail::_Select1st, std::equal_to<u▒
   1.92%  wasm-opt      libbinaryen.so                       [.] std::vector<wasm::Walker<wasm::BranchUtils::BranchSeeker, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::BranchSeeker, void> >::Task, std::allocator<wasm::Walker<wasm::BranchUtils::▒
   1.53%  wasm-opt      libbinaryen.so                       [.] wasm::CoalesceLocals::pickIndicesFromOrder                                                                                                                                              ▒
   1.37%  wasm-opt      libc.so.6                            [.] __strcmp_avx2                                                                                                                                                                           ▒
   1.26%  wasm-opt      libc.so.6                            [.] _int_free                                                                                                                                                                               ▒
   1.16%  wasm-opt      libc.so.6                            [.] __libc_malloc                                                                                                                                                                           ▒
   0.94%  wasm-opt      libbinaryen.so                       [.] wasm::LocalGraphInternal::Flower::flow                                                                                                                                                  ▒
   0.92%  wasm-opt      libbinaryen.so                       [.] wasm::PostWalker<wasm::BranchUtils::getBranchTargets(wasm::Expression*)::Scanner, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::getBranchTargets(wasm::Expression*)::Scanner, void> ▒
   0.90%  wasm-opt      libc.so.6                            [.] __libc_free                                                                                                                                                                             ▒
   0.74%  wasm-opt      libbinaryen.so                       [.] wasm::Walker<wasm::BranchUtils::BranchSeeker, wasm::UnifiedExpressionVisitor<wasm::BranchUtils::BranchSeeker, void> >::pushTask                                                         ▒
   0.70%  wasm-opt      libstdc++.so.6.0.29                  [.] std::_Rb_tree_insert_and_rebalance                                                                                                                                                      ▒
   0.68%  wasm-opt      libbinaryen.so                       [.] wasm::BranchUtils::operateOnScopeNameUses<wasm::BranchUtils::operateOnScopeNameUsesAndSentTypes<wasm::BranchUtils::BranchSeeker::visitExpression(wasm::Expression*)::{lambda(wasm::Name&▒
   0.67%  wasm-opt      libbinaryen.so                       [.] wasm::TypeUpdater::visitExpression                                                                                                                                                      ▒
   0.62%  wasm-opt      libbinaryen.so                       [.] std::__detail::_Map_base<unsigned long, std::pair<unsigned long const, unsigned char>, std::allocator<std::pair<unsigned long const, unsigned char> >, std::__detail::_Select1st, std::e▒
   0.59%  wasm-opt      libbinaryen.so                       [.] wasm::PostWalker<wasm::validateBinaryenIR(wasm::Module&, wasm::ValidationInfo&)::BinaryenIRValidator, wasm::UnifiedExpressionVisitor<wasm::validateBinaryenIR(wasm::Module&, wasm::Valid▒
   0.53%  wasm-opt      libbinaryen.so                       [.] wasm::PostWalker<wasm::EffectAnalyzer::InternalAnalyzer, wasm::OverriddenVisitor<wasm::EffectAnalyzer::InternalAnalyzer, void> >::scan                                                  ▒
kripken commented 2 years ago

Interesting that the wasm-pack build is so slow on duplicate-function-elimination with additional cores. The parallel part there is pretty trivial (no atomics or mutexes or anything - it just does a simple replacement of function names in parallel, using the fact that functions have disjoint data). So I'm not sure what's slow there, but maybe profiling that one pass specifically could show something.

MichaelMauderer commented 2 years ago

Running on Windows with 16 cores and the latest binaries (107).

[INFO]: Optimizing wasm binaries with `wasm-opt`...
[PassRunner] running passes
[PassRunner]   running pass: duplicate-function-elimination... 1.10864 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                 0.0294545 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: once-reduction...                 0.0611963 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: ssa-nomerge...                    3.47568 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                            1.92887 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.582926 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              2.23225 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.469429 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          3.57925 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: pick-load-signs...                0.536011 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     3.40764 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-pushing...                   0.716726 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...    15.432 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.31819 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.712375 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              1.74165 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                3.67034 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: local-cse...                      2.65892 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals...                21.2357 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.19059 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.626389 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: coalesce-locals...                2.73532 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: reorder-locals...                 0.625499 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.10039 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: code-folding...                   2.79751 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.66299 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              2.01292 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.443504 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: merge-blocks...                   1.61059 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     3.3332 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          3.17921 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: rse...                            1.62575 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: vacuum...                         3.19819 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dae-optimizing...                 4.84012 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: inlining-optimizing...            213.848 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-function-elimination... 0.37251 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: duplicate-import-elimination...   0.002494 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-globals-optimizing...    0.394977 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-module-elements...  0.471014 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: directize...                      0.041338 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: generate-stack-ir...              0.482444 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-stack-ir...              11.2527 seconds.
[PassRunner]   (validating)
[PassRunner] passes took 328.745 seconds.
[PassRunner] (final validation)

So, it seems the huge difference is gone in the current version.