Size regression in wasm builds from 1.39.1-fastcomp to 1.39.1 due to pthreads passive data segments not optimized for zeros

niklasf commented 4 years ago

There's a huge increase in binary size when building https://github.com/niklasf/stockfish.wasm (v0.5.11) with the latest emscripten.

Any information I could provide that would help to narrow this down?

1.38.48 (last good):

392K    stockfish.wasm
144K    stockfish.js
12K stockfish.worker.js
40K stockfish.js.mem
592K    total

1.39.0 (first bad):

1,3M    stockfish.wasm
144K    stockfish.js
12K stockfish.worker.js
40K stockfish.js.mem
1,5M    total

1.39.1 (still bad).

Flags: https://github.com/niklasf/stockfish.wasm/blob/bb648dd021e3e585def8b86050800f47001524a5/src/Makefile#L243

niklasf commented 4 years ago

Much better with 1.39.1-fastcomp:

132K    stockfish.js
40K stockfish.js.mem
392K    stockfish.wasm
12K stockfish.worker.js
576K    total

sbc100 commented 4 years ago

Can you post the linker command line options you use?

There are some tools for analyzing the size of WebAssembly binaries such as https://github.com/rustwasm/twiggy. In this case the size it quite significant so I suspect that something is preventing the GC of dead function in llvm upstream case. Seeing you command line options will help figure it out.

On Mon, Nov 4, 2019 at 6:14 AM Niklas Fiekas notifications@github.com wrote:

Much better with 1.39.1-fastcomp:

132K stockfish.js 40K stockfish.js.mem 392K stockfish.wasm 12K stockfish.worker.js 576K total

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

niklasf commented 4 years ago

Thanks. The Makefile linked above runs the following command for the source files:

em++ -Wall -Wcast-qual -std=c++11 -s MODULARIZE=1 -s EXPORT_NAME="Stockfish" -s NO_EXIT_RUNTIME=1 -s USE_PTHREADS=1 -s PTHREAD_POOL_SIZE=1 -s TOTAL_MEMORY=109051904 -s EXPORTED_FUNCTIONS="['_main', '_uci_command']" -s "EXTRA_EXPORTED_RUNTIME_METHODS=['ccall']" -s "INCOMING_MODULE_JS_API=['print', 'postRun']" --pre-js pre.js -DNDEBUG -O3 -DIS_64BIT -DNO_PREFETCH -msse3 -mpopcnt -DUSE_POPCNT -flto -c -o main.o main.cpp

The final binary is linked with basically the same flags:

em++ -o stockfish.js benchmark.o bitbase.o bitboard.o endgame.o evaluate.o main.o material.o misc.o movegen.o movepick.o pawns.o position.o psqt.o search.o thread.o timeman.o tt.o uci.o ucioption.o -s MODULARIZE=1 -s EXPORT_NAME="Stockfish" -s NO_EXIT_RUNTIME=1 -s USE_PTHREADS=1 -s PTHREAD_POOL_SIZE=1 -s TOTAL_MEMORY=109051904 -s EXPORTED_FUNCTIONS="['_main', '_uci_command']" -s "EXTRA_EXPORTED_RUNTIME_METHODS=['ccall']" -s "INCOMING_MODULE_JS_API=['print', 'postRun']" --pre-js pre.js -lpthread -Wall -Wcast-qual -std=c++11 -s MODULARIZE=1 -s EXPORT_NAME="Stockfish" -s NO_EXIT_RUNTIME=1 -s USE_PTHREADS=1 -s PTHREAD_POOL_SIZE=1 -s TOTAL_MEMORY=109051904 -s EXPORTED_FUNCTIONS="['_main', '_uci_command']" -s "EXTRA_EXPORTED_RUNTIME_METHODS=['ccall']" -s "INCOMING_MODULE_JS_API=['print', 'postRun']" --pre-js pre.js -DNDEBUG -O3 -DIS_64BIT -DNO_PREFETCH -msse3 -mpopcnt -DUSE_POPCNT -flto

I'll try twiggy now.

niklasf commented 4 years ago

Unfortunately twiggy panics on the version compiled with the llvm backend:

$ twiggy garbage 1.39.1-llvm-stockfish.wasm 
thread 'main' panicked at 'should not parse the same key into multiple items', /home/niklas/.cargo/registry/src/github.com-1ecc6299db9ec823/twiggy-ir-0.6.0/./ir.rs:53:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

Binaries attached, gzipped to be able to upload on GitHub:

1.39.1-fastcomp-stockfish.wasm.gz 1.39.1-llvm-stockfish.wasm.gz

kripken commented 4 years ago

Bloaty might be helpful too. Or just running wasm-dis and running emscripten's tools/find_bigfuncs.py.

It's possible the newer llvm in upstream inlines differently, which affects code size, but a difference that big sounds like something else. Should be easy to debug though given it's so big.

How much do things change when removing -flto from all the commands?

niklasf commented 4 years ago

I noticed something else while looking at the gzipped files:

136K    1.39.1-fastcomp-stockfish.wasm.gz
128K    1.39.1-llvm-stockfish.wasm.gz

So the difference goes away after compression. In fact LLVM is now better!

Will run bloaty and try removing -flto soon.

kripken commented 4 years ago

Any chance you can upload builds with names (--profiling-funcs at link)? That would make comparing them easier.

edit: nevermind :) see next comment

kripken commented 4 years ago

Comparing the existing builds, wasm-opt --metrics shows basically the entire difference is

- [memory-data]  : 0       
+ [memory-data]  : 1013251

So the fastcomp build has no memory in the wasm, but the upstream one does.

I believe that makes sense, as in upstream the data is in the wasm, and we use passive segments so load them only in the main thread. In fastcomp, we separate them out, and load them once in JS.

What is odd is that you wrote you have a 40K .mem file in both cases. Did the upstream build not emit one, but there was one already in that dir perhaps?

I see there are many many zeros in the upstream build's data segments. I guess we don't do the usual tricks to skip zeros in that case as they are passive @tlively ?

niklasf commented 4 years ago

Thank you very much, this explains the difference! I can confirm that it's just the .mem file from the previous build.

tlively commented 4 years ago

The MemoryPacking pass currently bails out in the presence of memory.init and data.drop segments, so zeros in non-bss segments will not be eliminated with pthread builds at the moment. This could be fixed by duplicating and updating bulk memory operations when the segments they reference are split or otherwise changed.

I've noted this on my backlog, but I'm not sure how to prioritize it. @kripken how critical do you think this is?

kripken commented 4 years ago

Good question. If this were for non-pthreads builds too I'd be much more concerned. As it's pthreads-only, it's still important but less I guess. I put it in the middle column on https://github.com/emscripten-core/emscripten/projects/1 now fwiw.

xiyuyizhi commented 4 years ago

same problem , when switch 1.39.0-fastcomp to 1.39.0-upstream, wasm size(ffmpeg libx) from 1.1M to 2.1M, use thread also

tlively commented 4 years ago

The fix for this has been merged upstream in https://github.com/WebAssembly/binaryen/pull/2426 and should make its way into the next release of Emscripten (1.39.7).

kleisauke commented 4 years ago

fwiw, on a large project (with pthreads) this reduced my .wasm from 4.66 MB to 2.89 MB (!).

But running this in the browser results in an error:

CompileError: WebAssembly.instantiate(): Compiling function #5031 failed: invalid data segment index: 4294967232 @+1726667

Has this error already been reported? If not, I'm happy to open a new issue with the relevant .wasm from v1.39.6 and master (I could also provide a minimal test case, if you want, but that would take a bit longer).

I'm testing with sdk-upstream-master-64bit and binaryen-master-64bit from emsdk. v1.39.6 is working flawless.

kleisauke commented 4 years ago

It seems that the above mentioned issue only affects Chrome. On Firefox Nightly this problem does not occur. Here's a minimal reproduction: https://kleisauke.nl/issue-9775/.

The problem occurs in __wasm_init_memory (func #0), the data-segment index 4294967232 is close to UINT32_MAX, so it is probably a problem in the bulk memory implementation of Chrome.

Does anyone know where I can report such things? I could submit a comment to https://bugs.chromium.org/p/v8/issues/detail?id=7747, but I don't know if this is appropriate.

binji commented 4 years ago

Thanks for reporting! It looks like Chrome is erroneously decoding the data segment index as signed instead of unsigned. I'm working on a fix now.

sbc100 commented 4 years ago

@tlively it looks like a v8 issue that only occurs with more than 63 passive data segments. Is there some flag that users stop binaryen from generating so many segments? (unless that fix makes it way into chrome). (if not maybe we should add one?)

tlively commented 4 years ago

Wow that's unfortunate. There's logic in the pass to prevent the number of segments from surpassing the web limit of 100,000 segments, so I guess we can just change the limit in that logic.

binji commented 4 years ago

As an (admittedly horrible) workaround, since the bug is in decoding a signed-LEB, you could avoid any segment index that has the bit pattern 01bb_bbbb (and 1bbb_bbbb 01bb_bbbb etc.) These will all be sign-extended. This means that 0..63, 128..8191, 16384..1048575 are OK.

tlively commented 4 years ago

I could do that, but I really would rather not :grimacing: @sbc100 @kripken do either of you think this is important enough for that hack?

tlively commented 4 years ago

https://github.com/WebAssembly/binaryen/pull/2613

binji commented 4 years ago

I filed a v8 bug here: https://bugs.chromium.org/p/v8/issues/detail?id=10151

kripken commented 4 years ago

I didn't understand that last suggestion that the question was about - what does "avoid" refer to, who would be doing that? (something in binaryen, or at runtime, or elsewhere?) Is that PR not sufficient?

binji commented 4 years ago

There are certain segment indexes that are interpreted incorrectly by v8, because they are decoded as signed numbers instead of unsigned numbers. 0 through 63 are OK, since the representation is the same for signed and unsigned LEB. But 64 through 128 are not, since 64 is decoded as -64 when interpreted as a signed LEB.

So in Binaryen, for this pass, you could potentially add empty data segments for the indexes that will be misinterpreted, and never use them with memory.init or data.drop instructions. This would allow you to raise the limit about 63 and still work with v8 (I believe).

kripken commented 4 years ago

I see, thanks! Yeah, that sounds like it would work too.

emscripten-core / emscripten

Size regression in wasm builds from 1.39.1-fastcomp to 1.39.1 due to pthreads passive data segments not optimized for zeros #9775