Closed kspalaiologos closed 1 year ago
Why is bzip2 still used?
No particular reason. As long as we can decompress that archive using a module that is part of python3.6 I think we would happily switch to a different format if there are benefits for be had.
@sbc100 Is the requirement of the codec being bundled with py3.6 so hard? Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have zstd
in PATH
, considering that it shortens the installation time by almost an order of magnitude?
@sbc100 Is the requirement of the codec being bundled with py3.6 so hard?
Its not set in stone, but we would rather not add more system dependencies.
Would switching to some other format that is built into python still give us some of the benefits which you are after?
Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have
zstd
inPATH
, considering that it shortens the installation time by almost an order of magnitude?
Uploading 2 different versions of the archive is possible I think it would add some complexity to the upload and downloading process. If you would like to experiment with PRs to emscripten-releases and emsdk then we could see just how much complexity it would add. (See https://chromium.googlesource.com/emscripten-releases/+/d7a2d5b091de9ea6937bbe6513e055c1bf750e6d/src/build.py#246 and https://github.com/emscripten-core/emsdk/blob/775ba048040f663abbca9ca66e264ee795f64ef3/emsdk_manifest.json#L37-L39)
(BTW this is the first time I've ever heard of this zstandard thing..)
Would switching to some other format that is built into python still give us some of the benefits which you are after?
Python does support LZMA out of the box. Decompression would of course be slower than zstandard, but still around 2-3 times better than the current solution. It would also save a lot of bandwidth over bzip2.
Actiually, looking at the code now it looks like call out to the system tar
executable to extract these archives: https://github.com/emscripten-core/emsdk/blob/775ba048040f663abbca9ca66e264ee795f64ef3/emsdk.py#L510-L517
That code seems to date back to 2013: fb549cdf5e0f7cb4c7296e96fb291712df10cc62
I'm guessing that code would "just work" given a .tar.xz
file? (assuming the host system has lzma
executable that tar
can use.. I wonder, does the base macOS image include that?)
You don't actually need lzma installed on the system. That said, bzip2 is bundled with python and still emsdk does not make use of it, calling whatever is installed on my system instead :). tar -I zstd -xvf archive.tar.zst
and tar -xJf file.pkg.tar.xz
could work. GNU Tar detects the compression format automatically, so you can just swap out .bz2 for .xz and nobody running coreutils would notice.
Doesn't the tar
executable fork out to the underlying lzma
or zstd
or bzip2
executable.. and if that is not installed the system the tar
command will fail right? At least I seem to remember folks reported tar
can fail when bzip2
is missing.
I guess it depends how tar was built and what version of tar is being used.
Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run tar -xvf
)
Doesn't the tar executable fork out to the underlying lzma or zstd or bzip2 executable.. and if that is not installed the system the tar command will fail right? At least I seem to remember folks reported tar can fail when bzip2 is missing.
Indeed, that is right.
Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run tar -xvf)
Yes, likely, but I don't have any experience with Macs.
It looks like Mac has supported tar.xz files since 10.10 (https://www.ctrl.blog/entry/archive-utility-xz.html). And it turns out we already use the xz archives for the version of Node we ship with emsdk on Linux, and nobody has complained. So I'd be in favor of switching given the size and decompression speed advantages.
We would probably have to do some hackery in the emsdk installer if we want it to support getting the bz2 archives for older versions of emscripten and xz for newer versions.
Some results from my initial attempts at switch to .xz
.
So it seems like we should go for it. We could even look at speeding up compression using the -T0
flag to xz
if that compression time is an issue.
I'm looking into add the magic to emsdk now (I think we will have to have it check for both filenames).
Yup! Passing the -T0
flag to xz gets compression time down to 16 seconds on my 56 core destkop (tar -I "xz -T0" -cf wasm-binaries2.tar.xz install/
), and only sacrafixed 1% on side (246M vs 242M).
emscripten-releases side CL is landing, let's keep an eye on things. Any appetite to help our windows users too? The windows archive has always been the largest (although not just because of the compression).
I'm personally inclined to leave windows alone, but mostly because i find debugging windows issues to be a lot harder than macOS or linux ones
Closing this for now since we removed the use of bzip2
Installing Emscripten for the first time on my machine takes approximately 1min 43.79s wall clock time. 1 min 29.44s out of this figure is spent in
bzip2 -d
decompressing thewasm-binaries.tbz2
archive, hence my question: why bzip2?BWT codecs are not a good choice for the kind of data contained inside of the archive. I have ran some tests involving better than bzip2 BWT codecs, such as bzip3, yielding an archive smaller by about 14%, but this is irrelevant as the total time spent in bzip3: (
-dj8
) is still pretty significant - 37.419s. BWT codecs tend to be symmetric either because of the SACA algorithm or the entropy coding stage. Further, they do not provide any preprocessing capabilities for executables contained within the archive.As such, I have tested a few LZ codecs. The archive produced by
zstd -9k
lies between bz2 and bz3 at around 330'331'630 bytes, but it is 25 times faster to decompress than bzip2 and 9 times faster to decompress than bzip3, hence using zstandard instead of bzip2 would improve the installation time from 1min 43s to 14s.bzip3 and zstandard are still admittedly unique on linux machines, but rather ubiquitous lzma provides an even better ratio, albeit considerably slower, which i have verified using
lzma -9k
and thenlzma -df
as 207'465'837 bytes, almost halving the distribution size (thanks to LZMA's executable code preprocessors, among others) with a decompression time of 35s.To conclude: using zstandard (or any LZ codec) instead of bzip2 would decrease download sizes by around 10% and speed up the installation process 6 times. Why is bzip2 still used?