emscripten-core / emsdk

Emscripten SDK
http://emscripten.org
Other
3.02k stars 688 forks source link

Rationale behind using bzip2 for wasm-binaries.tbz2 #1235

Closed kspalaiologos closed 1 year ago

kspalaiologos commented 1 year ago

Installing Emscripten for the first time on my machine takes approximately 1min 43.79s wall clock time. 1 min 29.44s out of this figure is spent in bzip2 -d decompressing the wasm-binaries.tbz2 archive, hence my question: why bzip2?

BWT codecs are not a good choice for the kind of data contained inside of the archive. I have ran some tests involving better than bzip2 BWT codecs, such as bzip3, yielding an archive smaller by about 14%, but this is irrelevant as the total time spent in bzip3: (-dj8) is still pretty significant - 37.419s. BWT codecs tend to be symmetric either because of the SACA algorithm or the entropy coding stage. Further, they do not provide any preprocessing capabilities for executables contained within the archive.

As such, I have tested a few LZ codecs. The archive produced by zstd -9k lies between bz2 and bz3 at around 330'331'630 bytes, but it is 25 times faster to decompress than bzip2 and 9 times faster to decompress than bzip3, hence using zstandard instead of bzip2 would improve the installation time from 1min 43s to 14s.

bzip3 and zstandard are still admittedly unique on linux machines, but rather ubiquitous lzma provides an even better ratio, albeit considerably slower, which i have verified using lzma -9k and then lzma -df as 207'465'837 bytes, almost halving the distribution size (thanks to LZMA's executable code preprocessors, among others) with a decompression time of 35s.

To conclude: using zstandard (or any LZ codec) instead of bzip2 would decrease download sizes by around 10% and speed up the installation process 6 times. Why is bzip2 still used?

sbc100 commented 1 year ago

Why is bzip2 still used?

No particular reason. As long as we can decompress that archive using a module that is part of python3.6 I think we would happily switch to a different format if there are benefits for be had.

kspalaiologos commented 1 year ago

@sbc100 Is the requirement of the codec being bundled with py3.6 so hard? Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have zstd in PATH, considering that it shortens the installation time by almost an order of magnitude?

sbc100 commented 1 year ago

@sbc100 Is the requirement of the codec being bundled with py3.6 so hard?

Its not set in stone, but we would rather not add more system dependencies.

Would switching to some other format that is built into python still give us some of the benefits which you are after?

Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have zstd in PATH, considering that it shortens the installation time by almost an order of magnitude?

Uploading 2 different versions of the archive is possible I think it would add some complexity to the upload and downloading process. If you would like to experiment with PRs to emscripten-releases and emsdk then we could see just how much complexity it would add. (See https://chromium.googlesource.com/emscripten-releases/+/d7a2d5b091de9ea6937bbe6513e055c1bf750e6d/src/build.py#246 and https://github.com/emscripten-core/emsdk/blob/775ba048040f663abbca9ca66e264ee795f64ef3/emsdk_manifest.json#L37-L39)

sbc100 commented 1 year ago

(BTW this is the first time I've ever heard of this zstandard thing..)

kspalaiologos commented 1 year ago

Would switching to some other format that is built into python still give us some of the benefits which you are after?

Python does support LZMA out of the box. Decompression would of course be slower than zstandard, but still around 2-3 times better than the current solution. It would also save a lot of bandwidth over bzip2.

sbc100 commented 1 year ago

Actiually, looking at the code now it looks like call out to the system tar executable to extract these archives: https://github.com/emscripten-core/emsdk/blob/775ba048040f663abbca9ca66e264ee795f64ef3/emsdk.py#L510-L517

That code seems to date back to 2013: fb549cdf5e0f7cb4c7296e96fb291712df10cc62

I'm guessing that code would "just work" given a .tar.xz file? (assuming the host system has lzma executable that tar can use.. I wonder, does the base macOS image include that?)

kspalaiologos commented 1 year ago

You don't actually need lzma installed on the system. That said, bzip2 is bundled with python and still emsdk does not make use of it, calling whatever is installed on my system instead :). tar -I zstd -xvf archive.tar.zst and tar -xJf file.pkg.tar.xz could work. GNU Tar detects the compression format automatically, so you can just swap out .bz2 for .xz and nobody running coreutils would notice.

sbc100 commented 1 year ago

Doesn't the tar executable fork out to the underlying lzma or zstd or bzip2 executable.. and if that is not installed the system the tar command will fail right? At least I seem to remember folks reported tar can fail when bzip2 is missing.

I guess it depends how tar was built and what version of tar is being used.

sbc100 commented 1 year ago

Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run tar -xvf)

kspalaiologos commented 1 year ago

Doesn't the tar executable fork out to the underlying lzma or zstd or bzip2 executable.. and if that is not installed the system the tar command will fail right? At least I seem to remember folks reported tar can fail when bzip2 is missing.

Indeed, that is right.

Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run tar -xvf)

Yes, likely, but I don't have any experience with Macs.

dschuff commented 1 year ago

It looks like Mac has supported tar.xz files since 10.10 (https://www.ctrl.blog/entry/archive-utility-xz.html). And it turns out we already use the xz archives for the version of Node we ship with emsdk on Linux, and nobody has complained. So I'd be in favor of switching given the size and decompression speed advantages.

We would probably have to do some hackery in the emsdk installer if we want it to support getting the bz2 archives for older versions of emscripten and xz for newer versions.

sbc100 commented 1 year ago

Some results from my initial attempts at switch to .xz.

So it seems like we should go for it. We could even look at speeding up compression using the -T0 flag to xz if that compression time is an issue.

I'm looking into add the magic to emsdk now (I think we will have to have it check for both filenames).

sbc100 commented 1 year ago

Yup! Passing the -T0 flag to xz gets compression time down to 16 seconds on my 56 core destkop (tar -I "xz -T0" -cf wasm-binaries2.tar.xz install/ ), and only sacrafixed 1% on side (246M vs 242M).

dschuff commented 1 year ago

emscripten-releases side CL is landing, let's keep an eye on things. Any appetite to help our windows users too? The windows archive has always been the largest (although not just because of the compression).

sbc100 commented 1 year ago

I'm personally inclined to leave windows alone, but mostly because i find debugging windows issues to be a lot harder than macOS or linux ones

sbc100 commented 1 year ago

Closing this for now since we removed the use of bzip2