Closed dholth closed 1 year ago
Our conda-build update, to correctly pass 16 from a config file, instead of the (too high?) default of 22, should also reduce memory usage.
I am having trouble figuring out how much from the zstd manual or man pages.
A quick memory test.
import zstandard
import sys
import io
compressor = zstandard.ZstdCompressor(level=int(sys.argv[1]))
writer = compressor.stream_writer(io.BytesIO())
writer.write(b"hello")
writer.close()
OSX's /usr/bin/time
shows us memory needs:
% /usr/bin/time -l python memtest.py 12
36225024 maximum resident set size
% /usr/bin/time -l python memtest.py 16
44367872 maximum resident set size
% /usr/bin/time -l python memtest.py 19
95436800 maximum resident set size
% /usr/bin/time -l python memtest.py 22
691208192 maximum resident set size
So we go from 36MB, 44MB all the way up to 691MB. The conda-build update should help - we were experiencing a bug where conda-forge wanted -16 but got the default of -22.
I think it is possible to pass the input size into zstdcompressor, if we use the API more smartly. This would help when packages are less than e.g. the 691MB buffer seen in -22.
I would be open to changing our default to -19 which is the highest the command line tool lets you do, without passing --ultra...
conda-package-handling 2.0
's conda creation code is in the style of conda-package-streaming, but is independent https://github.com/conda/conda-package-handling/blob/main/src/conda_package_handling/conda_fmt.py#L104
(It would be tricky but very nice to figure out a good "give me successive objects that can stream data into the .conda zip archive" API, all in memory, to go into conda-package-streaming)
I highly recommend going for -19
at max. Not just because of the resources during compression but also for the decompression on the users' side.
For decompression you'd need some sub 10M base memory plus the window size (which is capped by the input size, i.e., it's most relevant for big packages), e.g.,
2**23=8M
2**25=32M
2**26=64M
2**27=128M
For compression the higher levels additionally use bigger search tables which is why the memory increases more. See https://github.com/facebook/zstd/blob/v1.5.2/lib/compress/clevels.h#L44-L50 for the actual (logarithmic) values used for each level and https://manpages.ubuntu.com/manpages/focal/man1/zstd.1.html for a short description of those.
You get slightly higher resource usage for -19
compared to -18
during compression (and the same low max x+8M
for decompression). I'd go for one of those and try -19
first and see if we encounter cases that demand lowering it to -18
if needed.
We're also probably going for -19 as the default on the conda-forge side, see https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/217 . (Yay, I could just copy/paste the comment from here :grin:.)
Hopefully our parallel decompression will be okay...
Hopefully our parallel decompression will be okay...
If https://github.com/conda/conda/blob/22.11.0/conda/core/package_cache_data.py#L72 sets the max. parallelism, then you're most definitely fine :). Plus, you need to parallelly extract a bunch of >=128M tarballs to get peak memory usage. And only the current builds from conda-forge from the last two weeks are affected (AFAICT Anaconda's are transmuted with a custom lower level, i.e., not affected). No need to worry there.
I think it is possible to pass the input size into zstdcompressor, if we use the API more smartly. This would help when packages are less than e.g. the 691MB buffer seen in -22.
I suspect that cph 2.x is much more likely to encounter OOM situations than the 1.x ones -- not unlikely due to the (if I understand the comment above correctly) fact that no size hints are provided.
(because we saw the OOM issue in the last few hours after the 2.x was published on conda-forge quite frequently)
I did a simple test after using cph x
to extract a package, and then recreating it, using OSX's peak-memory-showing /usr/bin/time
. This can be run in 1.x and 2.x since the CLI is the same. create
, c
should be more like what conda-build does, compared to transmute
. The memory usage looks consistent with cph 2.0 keeping two zstd compressors in memory at the same time (for the pkg- and info- components). Though libarchive as used in 1.9 is definitely using a different version of the zstandard library.
/usr/bin/time -l cph c python-3.11.0-h93c2e33_0_cpython --out-folder /tmp/ python-3.11.0-h93c2e33_0_cpython.conda
On Linux I get about twice as much memory usage:
(base) conda list conda-package-handling
# packages in environment at /opt/conda:
#
# Name Version Build Channel
conda-package-handling 1.9.0 py310h5764c6d_1 conda-forge
(base) /usr/bin/time --format=%M cph c python-3.10.8-h4a9ceb5_0_cpython.conda --out-folder /tmp/ python-3.10.8-h4a9ceb5_0_cpython.conda
706156
(base) rm /tmp/python-3.10.8-h4a9ceb5_0_cpython.conda
(base) mamba install conda-package-handling\>=2 -y >/dev/null 2>&1
(base) conda list conda-package-handling
# packages in environment at /opt/conda:
#
# Name Version Build Channel
conda-package-handling 2.0.1 pyh38be061_0 conda-forge
(base) /usr/bin/time --format=%M cph c python-3.10.8-h4a9ceb5_0_cpython.conda --out-folder /tmp/ python-3.10.8-h4a9ceb5_0_cpython.conda
1394704
The size hints are easy to get at https://python-zstandard.readthedocs.io/en/latest/compressor.html?highlight=stream_writer#zstandard.ZstdCompressor.stream_writer but I'm not currently prepared to reintroduce temporary files, to be able to use size hints. It would be a good experiment to measure their impact. It might be easy to add them on extract in conda-package-streaming.
I wonder about the practicality of using python-zstandard
though.. From a first glance I could not see an exposed method on the compressor instance to free memory and del compressor
also didn't seem to free anything.
For a CLI invocation of cph
or a single package build gh-169 would be enough.
But what about recipes with multiple outputs? What about the conda-build
test suite?
And most importantly: Does the decompressor behave the same? Then we'd have a problem with package extraction in conda
itself -- no matter if parallel or serial...
I'd suggest that we pull the conda-forge::conda-package-handling=2.0.1
packages until those questions are resolved.
https://github.com/conda/conda-package-handling/blob/memory/tests/memory_test.py
uses same memory with times=1 or =156
Interesting.. Any idea why we saw different behavior for cph
?
Some nuance of reference counting surely.
Do we have an excuse to use memray?
If it's any comfort the extract part of conda-package-streaming
used by this package was originally written for a system that extracts every single package in defaults in a loop, to do some analysis.
2.0.2 released
I'd still be good to have a ticket open to track the questions from https://github.com/conda/conda-package-handling/issues/167#issuecomment-1333023530 (i.e., until it's ensured that it doesn't cause memory leaks if CondaFormat_v2.create
is run multiple times in the same Python process).
Checklist
What happened?
zstd
appears to be using a surprising amount of memory on Windows, even if the package is small. It shouldn't actually use more memory than the total size of the uncompressed data. It looks like it might be trying to allocate that memory anyway?Conda Info
No response
Conda Config
No response
Conda list
No response
Additional Context
No response