emscripten-core / emscripten

Emscripten: An LLVM-to-WebAssembly Compiler
Other
25.89k stars 3.32k forks source link

Link time slowdown in an optimized build #21521

Open mere-human opened 8 months ago

mere-human commented 8 months ago

I work on a large project that is built with Emscripten. When upgrading EMSDK from version 3.1.37 to 3.1.50, I noticed an increase in link time for a release build. The slowdown occurs only when building on Windows, but not on Mac. The link time remains relatively the same in version 3.1.54.

Link time for the main large UI target:

OS\EMSDK 3.1.37 3.1.50
Windows 8m 11m
Mac 14m 12m

For another smaller test UI target, the link time between .37 and .50 went from 2:43m to 3:22m on Windows. As it can be seen, the slowdown scales up. However, for small CLI targets such as unit tests, the link time has actually improved. The project contains a lot of unused code that is eliminated during the wasm-opt phase. The amount of code as well as linker flags has remained the same between EMSDK versions.

What could be the reason for this, and are there any ways to solve it? I suspect it could be due to LLVM 17 update that happened in 3.1.50. I can provide more details if needed.

Version of emscripten/emsdk:

emcc -v
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.54 (a95c44ee924d02fa1498f846595485d27c31daa8)
clang version 19.0.0git (https://github.com/llvm/llvm-project e769fb8699e3fa8e40623764f7713bfc783b0330)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: C:\Users\user\emsdk\3.1.54\upstream\bin

Failing command line in full: Link command for the smaller program below.

Full link command and output with -v appended:

Linker args were simplified to exclude unnecessary file names. Linker args are specified through an input file. Args: em++.bat -L Libs -sEXPORTED_FUNCTIONS=_main -sUSE_WEBGL2=1 --bind -sUSE_PTHREADS=1 -sCASE_INSENSITIVE_FS=1 -O2 -sPTHREAD_POOL_SIZE=6 -Wl,--error-limit=0 -sEXPORTED_RUNTIME_METHODS=ccall,cwrap -sPTHREAD_POOL_SIZE_STRICT=2 -sDISABLE_EXCEPTION_CATCHING=0 -sLLD_REPORT_UNDEFINED -sTOTAL_MEMORY=2147418112 --profiling-funcs --bind -sUSE_WEBGL2=1 -sFETCH=1 -sSTACK_SIZE=5MB -sEXPORTED_FUNCTIONS=_main,_htons,_ntohs -sEXPORTED_RUNTIME_METHODS=ccall,cwrap -sPROXY_TO_PTHREAD=1 --pre-js=script1.js --pre-js=script2.js ... -o app.html source1.o lib1.a ... --js-library lib1.js ...

Some of the static libs use "-Wl,--whole-archive" or "-Wl,--no-whole-archive".

Linker output: Output for the smaller program.

Click to expand I modified some file names. There were 4 .o files and 86 .a files. ``` "C:/Users/user/emsdk/3.1.54/upstream/bin\clang.exe" --version cache:INFO: generating system asset: symbol_lists/bfb8fc0f93e874314fc7c836514c2df47f9c2e3c.json... (this will be cached in "C:\Users\user\emsdk\3.1.54\upstream\emscripten\cache\symbol_lists\bfb8fc0f93e874314fc7c836514c2df47f9c2e3c.json" for subsequent builds) "C:/Users/user/emsdk/3.1.54/node/16.20.0_64bit/bin/node.exe" C:\Users\user\emsdk\3.1.54\upstream\emscripten\src\compiler.mjs C:\Users\user\AppData\Local\Temp\tmp2zp4ahks.json --symbols-only cache:INFO: - ok "C:/Users/user/emsdk/3.1.54/upstream/bin\wasm-ld.exe" -o Release/App.wasm -LRelease/ --error-limit=0 obj/Apps/App/entry.o ... --whole-archive obj/Apps/App/AppCore.a ... --no-whole-archive obj/ThirdParty/zlib/zlib/zlib.a --whole-archive ... -LC:\Users\user\emsdk\3.1.54\upstream\emscripten\cache\sysroot\lib\wasm32-emscripten C:\Users\user\emsdk\3.1.54\upstream\emscripten\cache\sysroot\lib\wasm32-emscripten\crtbegin.o C:\Users\user\emsdk\3.1.54\upstream\emscripten\cache\sysroot\lib\wasm32-emscripten\crt1_proxy_main.o --whole-archive -lfetch-mt -lembind-rtti --no-whole-archive -lGL-mt-webgl2-getprocaddr -lal -lhtml5 -lbulkmemory -lstubs -lnoexit -lc-mt -ldlmalloc-mt -lcompiler_rt-mt -lc++-mt -lc++abi-mt -lsockets-mt -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-cxx-exceptions -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr C:\Users\user\AppData\Local\Temp\[tmpn2_uk2smlibemscripten_js_symbols.so](http://tmpn2_uk2smlibemscripten_js_symbols.so/) --import-memory --shared-memory --export=htons --export=ntohs --export=_emscripten_thread_init --export=_emscripten_thread_exit --export=_emscripten_thread_crashed --export=_emscripten_tls_init --export=pthread_self --export=_embind_initialize_bindings --export=stackAlloc --export=stackSave --export=stackRestore --export=_emscripten_thread_free_data --export=emscripten_main_runtime_thread_id --export=emscripten_main_thread_process_queued_calls --export=_emscripten_run_on_main_thread_js --export=emscripten_stack_set_limits --export=__get_temp_ret --export=__set_temp_ret --export=__cxa_is_pointer_type --export=__cxa_can_catch --export=__cxa_increment_exception_refcount --export=__cxa_decrement_exception_refcount --export=setThrew --export=__cxa_free_exception --export=__wasm_call_ctors --export=_emscripten_thread_init --export=_emscripten_thread_exit --export-if-defined=__start_em_asm --export-if-defined=__stop_em_asm --export-if-defined=__start_em_lib_deps --export-if-defined=__stop_em_lib_deps --export-if-defined=__start_em_js --export-if-defined=__stop_em_js --export-if-defined=main --export-if-defined=__main_argc_argv --export-table -z stack-size=5242880 --initial-memory=4294901760 --max-memory=4294901760 --entry=_emscripten_proxy_main --table-base=1 --global-base=1024 "C:/Users/user/emsdk/3.1.54/upstream/bin\llvm-objcopy.exe" Release/App.wasm Release/App.wasm --remove-section=.debug* --remove-section=producers "C:/Users/user/emsdk/3.1.54/upstream\bin\wasm-emscripten-finalize" -g --dyncalls-i64 --pass-arg=legalize-js-interface-exported-helpers Release/App.wasm -o Release/App.wasm --detect-features "C:/Users/user/emsdk/3.1.54/node/16.20.0_64bit/bin/node.exe" C:\Users\user\emsdk\3.1.54\upstream\emscripten\src\compiler.mjs C:\Users\user\AppData\Local\Temp\tmpeng7appy.json "C:/Users/user/emsdk/3.1.54/upstream\bin\wasm-opt" --strip-target-features --post-emscripten -O2 --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable Release/App.wasm -o Release/App.wasm -g --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext --enable-simd "C:/Users/user/emsdk/3.1.54/node/16.20.0_64bit/bin/node.exe" C:\Users\user\emsdk\3.1.54\upstream\emscripten\tools\acorn-optimizer.mjs C:\Users\user\AppData\Local\Temp\emscripten_temp_4rvly0q9\App.js unsignPointers -o C:\Users\user\AppData\Local\Temp\emscripten_temp_4rvly0q9\App.jso1.js "C:/Users/user/emsdk/3.1.54/node/16.20.0_64bit/bin/node.exe" C:\Users\user\emsdk\3.1.54\upstream\emscripten\tools\acorn-optimizer.mjs C:\Users\user\AppData\Local\Temp\emscripten_temp_4rvly0q9\App.jso1.js JSDCE minifyWhitespace -o C:\Users\user\AppData\Local\Temp\emscripten_temp_4rvly0q9\App.jso2.js "C:/Users/user/emsdk/3.1.54/node/16.20.0_64bit/bin/node.exe" C:\Users\user\emsdk\3.1.54\upstream\emscripten\tools\preprocessor.mjs C:\Users\user\AppData\Local\Temp\emscripten_temp_4rvly0q9\settings.js worker.js --expandMacros "C:/Users/user/emsdk/3.1.54/node/16.20.0_64bit/bin/node.exe" C:\Users\user\emsdk\3.1.54\upstream\emscripten\tools\acorn-optimizer.mjs C:\Users\user\repo\source\out\PROD-WASM-EMSCRIPTEN\Release\Programs\App.worker.js minifyWhitespace "C:/Users/user/emsdk/3.1.54/node/16.20.0_64bit/bin/node.exe" C:\Users\user\emsdk\3.1.54\upstream\emscripten\tools\preprocessor.mjs C:\Users\user\AppData\Local\Temp\emscripten_temp_4rvly0q9\settings.js Shell.html "C:\Users\user\emsdk\3.1.54\upstream\emscripten\node_modules\.bin\html-minifier-terser.cmd" Release/App.html -o Release/App.html --collapse-whitespace --collapse-inline-tag-whitespace --remove-comments --remove-tag-whitespace --sort-attributes --sort-class-name --decode-entities --collapse-boolean-attributes --remove-attribute-quotes --remove-redundant-attributes --remove-script-type-attributes --remove-style-link-type-attributes --use-short-doctype --minify-css true --minify-js true ```

When building with EMCC_DEBUG=1, I can see that most of the link time is spent in wasm-opt & binariyen. Although, without much details. Example log entries (for a smaller program):

profiler:DEBUG: block "wasm_opt" took 173.607 seconds
profiler:DEBUG: block "binaryen" took 174.753 seconds
profiler:DEBUG: block "post link" took 187.336 seconds
emcc:DEBUG: total time: 192.53 seconds
profiler:DEBUG: block "main" took 192.531 seconds
sbc100 commented 8 months ago

Thanks for the report. The best way to proceed with this kind of regression is to try to bisect to smaller range and see if we can narrow it down to a specific change. Can you take a look at the bisect instruction here: https://emscripten.org/docs/contributing/developers_guide.html#bisecting

sbc100 commented 8 months ago

I think the next step is to bisect to find out which binaryen change (or llvm change) triggered your regression.

@tlively and @kripken maybe be able to hazard a guess given to revision range, but bisecting is always extremely useful.

kripken commented 8 months ago

Another profiling tool you can use for Binaryen specifically is pass timings. Putting BINARYEN_PASS_DEBUG=1 in the env will emit stuff like this:

[PassRunner] running passes
[PassRunner]   running pass: duplicate-function-elimination... 0.00613846 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: memory-packing...                 0.0124154 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: dce...                            0.106082 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.0273986 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-brs...              0.111207 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: remove-unused-names...            0.0274284 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: optimize-instructions...          0.129891 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: precompute...                     0.0737234 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: tuple-optimization...             0.000672911 seconds.
[PassRunner]   (validating)
[PassRunner]   running pass: simplify-locals-nostructure...    0.21312 seconds.

That is, it prints the time each pass takes. (It also validates, which makes the build slower, but doesn't interfere with the timing measurement.)

mere-human commented 8 months ago

@sbc100 I think I found a commit using instructions from the doc. I hope I did everything right. It seems to be a part of 3.1.50 version.

git bisect start
# good: [694434b6d47c5f6eff2c8fbd9eeb016c977ae9dc] 3.1.48 LTO RC
...
# bad: [2ce4170cef5ce46f337f9fd907b614a8db772c7d] 3.1.50 LTO RC
...

1485e96d7c269709e25810e57ea368c9af7993d6 is the first bad commit
commit 1485e96d7c269709e25810e57ea368c9af7993d6
Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Date:   Tue Nov 28 23:38:14 2023 +0000

    Roll binaryen from 2ea76680ae72 to dbcac17d645d (1 revision)

    https://chromium.googlesource.com/external/github.com/WebAssembly/binaryen.git/+log/2ea76680ae72..dbcac17d645d

    2023-11-28 azakai@google.com [NFC] Move InstrumentedPass logic out and use it in another place (#6132)

    If this roll has caused a breakage, revert this CL and stop the roller
    using the controls here:
    https://autoroll.skia.org/r/binaryen-emscripten-releases
    Please CC dschuff@google.com,wasm-waterfall@grotations.appspotmail.com on the revert to ensure that a human
    is aware of the problem.

    To report a problem with the AutoRoller itself, please file a bug:
    https://issues.skia.org/issues/new?component=1389291&template=1850622

    Documentation for the AutoRoller is here:
    https://skia.googlesource.com/buildbot/+doc/main/autoroll/README.md

    Tbr: wasm-waterfall@grotations.appspotmail.com
    Change-Id: Ib39a12223ff7d03e0ed5b9da12e39618c2c38e39
    Reviewed-on: https://chromium-review.googlesource.com/c/emscripten-releases/+/5069256
    Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
    Bot-Commit: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>

 DEPS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
mere-human commented 8 months ago

Below are the results for one of the targets (test UI app with EMSDK 3.1.54) where we see the slowdown. wasm-ld command line (modified to exclude project names):

shared:DEBUG: successfully executed C:/Users/user/emsdk/3.1.54/upstream/bin\wasm-ld.exe -o Release/1/App.wasm -LRelease/1 --error-limit=0 obj/Apps/App/App/entry.o obj/obj1.o --whole-archive obj/lib1.a obj/lib2.a obj/lib3.a --no-whole-archive .... --no-whole-archive ../../ThirdParty/Google/Skia/libs/wasm/libskia.a ../../ThirdParty/Google/Skia/libs/wasm/libskparagraph.a ../../ThirdParty/Google/Skia/libs/wasm/libskshaper.a ../../ThirdParty/Google/Skia/libs/wasm/libskunicode.a ../../ThirdParty/Apache/Xerces/lib/wasm/libxerces-c.a ../../ThirdParty/ICU/lib/wasm/libicudata.a ../../ThirdParty/ICU/lib/wasm/libicui18n.a ../../ThirdParty/ICU/lib/wasm/libicuio.a ../../ThirdParty/ICU/lib/wasm/libicutu.a ../../ThirdParty/ICU/lib/wasm/libicuuc.a ../../ThirdParty/Harfbuzz/lib/wasm/libharfbuzz.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_calib3d.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_core.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_dnn.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_features2d.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_flann.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_imgproc.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_objdetect.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_photo.a ../../ThirdParty/OpenCVOrg/OpenCV/lib/wasm/libopencv_video.a -LC:\Users\user\emsdk\3.1.54\upstream\emscripten\cache\sysroot\lib\wasm32-emscripten C:\Users\user\emsdk\3.1.54\upstream\emscripten\cache\sysroot\lib\wasm32-emscripten\crtbegin.o C:\Users\user\emsdk\3.1.54\upstream\emscripten\cache\sysroot\lib\wasm32-emscripten\crt1_proxy_main.o --whole-archive -lfetch-mt -lembind-rtti --no-whole-archive -lGL-mt-webgl2-getprocaddr -lal -lhtml5 -lbulkmemory -lstubs -lnoexit -lc-mt -ldlmalloc-mt -lcompiler_rt-mt -lc++-mt -lc++abi-mt -lsockets-mt -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-cxx-exceptions -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr C:\Users\user\AppData\Local\Temp\emscripten_temp\tmpisj41kgwlibemscripten_js_symbols.so --import-memory --shared-memory --export=htons --export=ntohs --export=_emscripten_thread_init --export=_emscripten_thread_exit --export=_emscripten_thread_crashed --export=_emscripten_tls_init --export=pthread_self --export=_embind_initialize_bindings --export=stackAlloc --export=stackSave --export=stackRestore --export=_emscripten_thread_free_data --export=emscripten_main_runtime_thread_id --export=emscripten_main_thread_process_queued_calls --export=_emscripten_run_on_main_thread_js --export=emscripten_stack_set_limits --export=__get_temp_ret --export=__set_temp_ret --export=__cxa_is_pointer_type --export=__cxa_can_catch --export=__cxa_increment_exception_refcount --export=__cxa_decrement_exception_refcount --export=setThrew --export=__cxa_free_exception --export=__wasm_call_ctors --export=_emscripten_thread_init --export=_emscripten_thread_exit --export-if-defined=__start_em_asm --export-if-defined=__stop_em_asm --export-if-defined=__start_em_lib_deps --export-if-defined=__stop_em_lib_deps --export-if-defined=__start_em_js --export-if-defined=__stop_em_js --export-if-defined=main --export-if-defined=__main_argc_argv --export-table -z stack-size=5242880 --initial-memory=4294901760 --max-memory=4294901760 --entry=_emscripten_proxy_main --table-base=1 --global-base=1024

4 obj.o files 86 static lib.a files

Binary sizes:

mere-human commented 8 months ago

Below is the linker output with BINARYEN_PASS_DEBUG=1 (same test UI target with EMSDK 3.1.54). Most notable phases are:

``` link:DEBUG: using binaryen [PassRunner] running passes [PassRunner] running pass: strip-target-features... 0.0052219 seconds. [PassRunner] (validating) [PassRunner] running pass: post-emscripten... 1.35448 seconds. [PassRunner] (validating) [PassRunner] passes took 1.3597 seconds. [PassRunner] (final validation) [PassRunner] running passes [PassRunner] running pass: duplicate-function-elimination... 3.9756 seconds. [PassRunner] (validating) [PassRunner] running pass: memory-packing... 1.55431 seconds. [PassRunner] (validating) [PassRunner] running pass: once-reduction... 0.220142 seconds. [PassRunner] (validating) [PassRunner] running pass: dce... 2.88644 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-names... 0.556645 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-brs... 2.45467 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-names... 0.52959 seconds. [PassRunner] (validating) [PassRunner] running pass: optimize-instructions... 3.98953 seconds. [PassRunner] (validating) [PassRunner] running pass: pick-load-signs... 0.591546 seconds. [PassRunner] (validating) [PassRunner] running pass: precompute... 3.50598 seconds. [PassRunner] (validating) [PassRunner] running pass: optimize-added-constants... 0.362271 seconds. [PassRunner] (validating) [PassRunner] running pass: code-pushing... 0.778157 seconds. [PassRunner] (validating) [PassRunner] running pass: simplify-locals-nostructure... 26.278 seconds. [PassRunner] (validating) [PassRunner] running pass: vacuum... 2.23215 seconds. [PassRunner] (validating) [PassRunner] running pass: reorder-locals... 0.677075 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-brs... 1.75115 seconds. [PassRunner] (validating) [PassRunner] running pass: coalesce-locals... 3.51701 seconds. [PassRunner] (validating) [PassRunner] running pass: simplify-locals... 30.138 seconds. [PassRunner] (validating) [PassRunner] running pass: vacuum... 2.11899 seconds. [PassRunner] (validating) [PassRunner] running pass: reorder-locals... 0.663686 seconds. [PassRunner] (validating) [PassRunner] running pass: coalesce-locals... 3.70763 seconds. [PassRunner] (validating) [PassRunner] running pass: reorder-locals... 0.658941 seconds. [PassRunner] (validating) [PassRunner] running pass: vacuum... 1.99957 seconds. [PassRunner] (validating) [PassRunner] running pass: merge-blocks... 1.72868 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-brs... 1.83788 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-names... 0.499146 seconds. [PassRunner] (validating) [PassRunner] running pass: merge-blocks... 1.71645 seconds. [PassRunner] (validating) [PassRunner] running pass: precompute... 3.33652 seconds. [PassRunner] (validating) [PassRunner] running pass: optimize-instructions... 3.26149 seconds. [PassRunner] (validating) [PassRunner] running pass: rse... 2.59451 seconds. [PassRunner] (validating) [PassRunner] running pass: vacuum... 2.0264 seconds. [PassRunner] (validating) [PassRunner] running pass: dae-optimizing... 67.4479 seconds. [PassRunner] (validating) [PassRunner] running pass: inlining-optimizing... 75.9753 seconds. [PassRunner] (validating) [PassRunner] running pass: duplicate-function-elimination... 1.06997 seconds. [PassRunner] (validating) [PassRunner] running pass: duplicate-import-elimination... 0.0047866 seconds. [PassRunner] (validating) [PassRunner] running pass: simplify-globals-optimizing... 0.992887 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-module-elements... 0.867906 seconds. [PassRunner] (validating) [PassRunner] running pass: reorder-globals... 0.0035941 seconds. [PassRunner] (validating) [PassRunner] running pass: directize... 0.284006 seconds. [PassRunner] (validating) [PassRunner] running pass: generate-stack-ir... 0.574415 seconds. [PassRunner] (validating) [PassRunner] running pass: optimize-stack-ir... 10.1396 seconds. [PassRunner] (validating) [PassRunner] passes took 269.509 seconds. [PassRunner] (final validation) [PassRunner] running passes [PassRunner] passes took 0 seconds. [PassRunner] (final validation) shared:DEBUG: successfully executed C:/Users/user/emsdk/3.1.54/upstream\bin\wasm-opt --strip-target-features --post-emscripten -O2 --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable Release/1/App.wasm -o Release/1/App.wasm -g --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext --enable-simd building:DEBUG: saving debug copy C:\Users\user\AppData\Local\Temp\emscripten_temp\emcc-04-wasm-opt.wasm building:DEBUG: saving debug copy C:\Users\user\AppData\Local\Temp\emscripten_temp\emcc-05-byn.wasm profiler:DEBUG: block "wasm_opt" took 798.307 seconds ```

Comparing this to the output for the same target but with EMSDK 3.1.37, I can see that "dae-optimizing" takes almost x10 longer. "simplify-locals" increased as well.

``` emcc:DEBUG: using binaryen [PassRunner] running passes [PassRunner] running pass: strip-dwarf... 0.0071776 seconds. [PassRunner] (validating) [PassRunner] running pass: post-emscripten... 1.28154 seconds. [PassRunner] (validating) [PassRunner] running pass: duplicate-function-elimination... 3.04294 seconds. [PassRunner] (validating) [PassRunner] running pass: memory-packing... 1.47303 seconds. [PassRunner] (validating) [PassRunner] running pass: once-reduction... 0.190231 seconds. [PassRunner] (validating) [PassRunner] running pass: dce... 2.96166 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-names... 0.557084 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-brs... 2.43214 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-names... 0.530591 seconds. [PassRunner] (validating) [PassRunner] running pass: optimize-instructions... 4.09665 seconds. [PassRunner] (validating) [PassRunner] running pass: pick-load-signs... 0.596229 seconds. [PassRunner] (validating) [PassRunner] running pass: precompute... 3.48007 seconds. [PassRunner] (validating) [PassRunner] running pass: optimize-added-constants... 0.368012 seconds. [PassRunner] (validating) [PassRunner] running pass: code-pushing... 0.796706 seconds. [PassRunner] (validating) [PassRunner] running pass: simplify-locals-nostructure... 26.6814 seconds. [PassRunner] (validating) [PassRunner] running pass: vacuum... 4.1554 seconds. [PassRunner] (validating) [PassRunner] running pass: reorder-locals... 0.668175 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-brs... 1.73962 seconds. [PassRunner] (validating) [PassRunner] running pass: coalesce-locals... 4.16059 seconds. [PassRunner] (validating) [PassRunner] running pass: simplify-locals... 29.8688 seconds. [PassRunner] (validating) [PassRunner] running pass: vacuum... 4.46619 seconds. [PassRunner] (validating) [PassRunner] running pass: reorder-locals... 0.670595 seconds. [PassRunner] (validating) [PassRunner] running pass: coalesce-locals... 3.66475 seconds. [PassRunner] (validating) [PassRunner] running pass: reorder-locals... 0.664907 seconds. [PassRunner] (validating) [PassRunner] running pass: vacuum... 4.27954 seconds. [PassRunner] (validating) [PassRunner] running pass: merge-blocks... 1.74198 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-brs... 1.83444 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-names... 0.498485 seconds. [PassRunner] (validating) [PassRunner] running pass: merge-blocks... 1.74595 seconds. [PassRunner] (validating) [PassRunner] running pass: precompute... 3.24462 seconds. [PassRunner] (validating) [PassRunner] running pass: optimize-instructions... 3.26838 seconds. [PassRunner] (validating) [PassRunner] running pass: rse... 2.80599 seconds. [PassRunner] (validating) [PassRunner] running pass: vacuum... 3.74706 seconds. [PassRunner] (validating) [PassRunner] running pass: dae-optimizing... 6.93561 seconds. [PassRunner] (validating) [PassRunner] running pass: inlining-optimizing... 94.2066 seconds. [PassRunner] (validating) [PassRunner] running pass: duplicate-function-elimination... 0.922562 seconds. [PassRunner] (validating) [PassRunner] running pass: duplicate-import-elimination... 0.0045007 seconds. [PassRunner] (validating) [PassRunner] running pass: simplify-globals-optimizing... 0.979464 seconds. [PassRunner] (validating) [PassRunner] running pass: remove-unused-module-elements... 0.74016 seconds. [PassRunner] (validating) [PassRunner] running pass: reorder-globals... 0.0047505 seconds. [PassRunner] (validating) [PassRunner] running pass: directize... 0.235321 seconds. [PassRunner] (validating) [PassRunner] running pass: generate-stack-ir... 0.539002 seconds. [PassRunner] (validating) [PassRunner] running pass: optimize-stack-ir... 9.87425 seconds. [PassRunner] (validating) [PassRunner] running pass: strip-debug... 0.42347 seconds. [PassRunner] (validating) [PassRunner] running pass: strip-producers... 0.0037296 seconds. [PassRunner] (validating) [PassRunner] passes took 236.59 seconds. [PassRunner] (final validation) shared:DEBUG: successfully executed C:/Users/user/emsdk/3.1.37/upstream\bin\wasm-opt --strip-dwarf --post-emscripten -O2 --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers Release/1/App.wasm -o Release/1/App.wasm -g --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext --enable-simd building:DEBUG: saving debug copy C:\Users\user\AppData\Local\Temp\emscripten_temp\emcc-4-wasm-opt.wasm building:DEBUG: saving debug copy C:\Users\user\AppData\Local\Temp\emscripten_temp\emcc-5-byn.wasm profiler:DEBUG: block "wasm_opt" took 767.857 seconds ```
kripken commented 8 months ago

@mere-human can you provide the wasm file to reproduce this? We did measurements before landing that, and I did some more measurements now on wasm files I have locally, and can't reproduce the problem. It might be specific to that wasm somehow.

kripken commented 8 months ago

If you can't provide it, what I would do to investigate this on your machine is to run --dae-optimizing --optimize-level=3 on that wasm (that is the main regression here; simplify-locals looks like noise), with some profiling. E.g. on linux I would use perf stat -r 1 for CPU metrics, and perf record / perf report for profiling.

mere-human commented 8 months ago

We did measurements before landing that, and I did some more measurements now on wasm files I have locally, and can't reproduce the problem.

@kripken Thanks, I'll try that. But did you check it on Windows? I see the problem only on Windows but not on Mac (where the build time has actually improved). Also, I'm not sure I can use "perf" on Windows.

mere-human commented 8 months ago

I've just compared .wasm output for each phase with EMCC_DEBUG=1 produced on Mac and on Windows. All the binaries are identical. But I see the slowdown only on Windows.

kripken commented 8 months ago

Oh, sorry, I missed that this was only on Windows. I don't have a Windows machine myself - I tested on Linux.

In that case, maybe indirect calls are slower there for some reason? That is the only thing that PR should change, I think. I'm not sure how the OS can matter for indirect call speed, though I recall Spectre had some mitigations related to that...

Perhaps there is some Windows profiling tool that you can run to get some insights on what is happening here?