Closed dholth closed 1 year ago
There are at least two ways about it:
size=$( tar -cO ${file_list} | wc -c )
for the first read, i.e., avoid temp files.
(even "sum of all file sizes + 512 bytes (i.e., tar header size IIRC) * number of files" would be good enough)Downside for 2.: Could be slower than recompressing tarballs (more read I/O, esp. important for many small files) Upside for 2.: Don't have to deal with temporary files and error handling around them.
In most cases I expect the slower I/O to not be extremely relevant and avoidance of tempfile handling is a plus in my book.
One question is how bad is it if the size hint is wrong. And does the size hint really help?
I'm not sure whether or not we want to run tarfile.py twice over the same files, it would be worth measuring. If the files change between runs that would be bad.
On my laptop, even zstd -1 compression is slower than the disk. But it is really nice to use less disk space, and not everyone will have a fast disk.
This suite https://github.com/conda/conda-benchmarks/blob/main/benchmarks/peakmem_tests.py is an easy way to check memory, and could produce graphs like this if we set it up. https://dholth.github.io/conda-benchmarks/#conda_install.TimeInstall.time_explicit_install?conda-package-handling=1.9.0&conda-package-handling=2.0.0a2&p-latency=0.0&p-threads=1
I wrote a test at https://gist.github.com/dholth/0a8b26ddd361ae9a2440f19ceceaaa2e
Compression takes about the same amount of time when the size is known. The compressor produces the same number of bytes, plus a 3-byte header, for this 10MB file.
If you give the compressor the wrong number of bytes, it just fails.
If the size was not known on compression, you have to use the streaming decompression API instead of the one-shot .decompress(bytes)
API, although it is possible I should be calling another "flush frame" method.
According to the scalene profiler, at level 22,
peak memory, function
267MB oneshot
777MB onestream
266MB rightsize
774MB multistream
decompression peak memory
9.9MB one-shot decompression
128.5MB streaming decompression, size unknown
19.3MB streaming decompression, size known
(fails) one-shot decompression, size unknown
Memory usage is a far more reasonable ~90MB to compress and ~10-20MB to decompress at levels <= 19.
Fixed in #190 using the throwaway-TarFile strategy.
Checklist
What is the idea?
Consider using temporary (possibly zstd -3 compressed)
info-
andpkg-
tarballs when creating.conda
.According to the
zstd
documentation, zstandard is more efficient when you can tell the compressor the total size ahead of time. Maybe it will avoid allocating extra memory, to further improve on #167 ? We would write the temporary file, and then compress it into the ZIP-format.conda
archive giving the total size to the compressor. We would need to do some experiments to verify that this is an improvement.Unlike unpacking, the extra time spent creating these temporary files would usually be dwarfed by the time spent running the compressor at a high level.