duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.19k stars 125 forks source link

Reduce main module size #1574

Closed Y-- closed 8 months ago

Y-- commented 9 months ago

Note: This PR is more a PoC to start a conversation: it is not in a mergeable state.

Today, when building DuckDB WASM with the "loadable extension" flag, the bundles are quite big because when building with MAIN_MODULE=1, Emscripten exports all absolutely the symbols:

Bundle Uncompressed Brotli GZip
duckdb-eh.js 45M 1.4M 2.0M
duckdb-eh.wasm 32M 4.1M 6.2M
This PR is an attempt to reduce these. Here are the size when built with this PR: Bundle Uncompressed Brotli GZip
duckdb-eh.js 2.7M 193K 250K
duckdb-eh.wasm 19M 2.7M 4.2M

I was able to load all the in-tree DuckDB extensions, as well as ours with this build.

Of course, this is not that easy, here are the caveats:

Happy to hear your thoughts and whether it is a road worth pursuing, or if you have better ideas on how to solve this?

carlopi commented 9 months ago

Hi! Thanks a lot for bringing this up, and the exploration you did.

I think this is a bit too brittle, but if needed this can be considered as a bridge (potentially we might need a different list for each bundle, since functions might end up having different signatures, plus there are ad hoc functions like the personality function that do are missing if exceptions are not there, see https://github.com/duckdb/duckdb-wasm/actions/runs/7453872285/job/20280188337?pr=1574#step:7:5872).

The problem is partially that the functions are exported, but mainly the fact that for each exported functions it's name (as a string) it's repeated multiple times: 1 time in the wasm file, see the output of wasm2wat duckdb-eh.wasm --enable-all | grep export:

  (export "_ZNSt3__218__allocation_guardINS_9allocatorINS_20__shared_ptr_emplaceIN5arrow12NumericArrayINS3_10Date64TypeEEENS1_IS6_EEEEEEEC2B8ue170004IS7_EET_m" (func 185))
  (export "_ZNSt3__220__shared_ptr_emplaceIN5arrow12NumericArrayINS1_10Date64TypeEEENS_9allocatorIS4_EEEC2B8ue170004IJxNS_10shared_ptrINS1_6BufferEEERKSB_xxEEES6_DpOT_" (func 193))

that in my versions generates 80000+ lines. Then each symbol (the actual string, that might look like __ZN5arrow7struct_ERKNSt3__26vectorINS0_10shared_ptrINS_5FieldEEENS0_9allocatorIS4_EEEE, is repeated 5 more times in the connected JavaScript file. (EDIT: 5 times is for functions, there are also globals in the exports, will need to check better but I would assume a somewhat similar number)

Here I think there are few possible solutions:

Main problem of this remapping is that the original names still needs to be stored somewhere. Simple solution would be storing the mapping as part of the JavaScript code (this reduces savings from 90%+ to 75% on JS side, and keeps them Wasm side). This is probably the simpler version I would like to start experimenting with. Somehow more involved on the infrastructure side is keeping the mapping elsewhere (basically a property of duckdb), and have this mapping be used also to produce extensions, so the mapping becomes needed only at build-time (either for duckdb-wasm or for extensions).

domoritz commented 9 months ago

As you look at file sizes, please also measure gzip and brotli compressed sizes. We should track that as you make progress.

carlopi commented 8 months ago

Actually I think solution 1 (= iterating on available exports) should be doable, for now I implemented only a prototype but it seems that the whole overhead on JS files can be avoided (adding a post-processing step).

Wasm side the exports are still somehow pricey, so maybe both solution can be combined.

I have not explored changing optimization level, also that can be done independenlty.

Y-- commented 8 months ago

Thanks for the feedback!

Agree that solution 1 sounds the most promising (and the cleanest one). I actually tried to hack something quickly before the holiday break but couldn't come up with something obvious. I agree that this seems very doable indeed, happy to help on that front.

Solution 2 is basically what closure=1 is supposed to do, and while it works very well on the small examples I tried, I faced some errors with duckdb-wasm (though didn't investigate much time looking what the problem was exactly). In any case I think we should use it once we have exhausted solution 1.

And finally completely agree with @domoritz that we should also track the gzipped and brotli-compressed size. I actually almost provided the brotli size in the PR description but felt that while important, it is not the main point: these symbols are still an overhead for the browser even if bundle is compressed. I'll edit the description with them though :-)

@carlopi, let me know how I can best help you with solution 1.

carlopi commented 8 months ago

I found a solution to the size problem, I will open a PR soon (EDIT: it's here https://github.com/duckdb/duckdb-wasm/pull/1589), basics are that JavaScript size we don't need to expose any C++ side functions (apart for the few duckdb_web ones and some utilities listed as EXPORTED_FUNCTIONS).

I have not find a polished way to do this, butstripping them via awk from the JS side will be enough for now.

This solutions allows to reduce the size of the uncompressed duckdb-eh.js file to 2.0M, while the wasm files will not be modified (since the overhead on listing the exports will still be there).


The general problem is that Emscripten seems not to have (or at least, not one that I have found) a way to specify that there are two level of interfaces: symbols that needs to be externally available at the C++ side to other modules to be dlopened, and symbols that needs to be exposed to the JavaScript interface (for functions that needs to be there on the JS side).

I hope a solution can be implemented properly in Emscripten.


This PR is still interesting in the blue-print to select explicitly which functions are to be exported (C++ side), that allows to reduce the weight of the wasm side, so it might make sense to have both since they are independent, I will come back to this after merging the JS-side PR.

Y-- commented 8 months ago

Closing since https://github.com/duckdb/duckdb-wasm/pull/1589 is a much better approach.