conda / conda-package-handling

Create and extract conda packages of various formats
https://conda.github.io/conda-package-handling/
BSD 3-Clause "New" or "Revised" License
26 stars 37 forks source link

Consider using temporary files on create, for tighter zstd control. #171

Closed dholth closed 1 year ago

dholth commented 1 year ago

Checklist

What is the idea?

Consider using temporary (possibly zstd -3 compressed) info- and pkg- tarballs when creating .conda.

According to the zstd documentation, zstandard is more efficient when you can tell the compressor the total size ahead of time. Maybe it will avoid allocating extra memory, to further improve on #167 ? We would write the temporary file, and then compress it into the ZIP-format .conda archive giving the total size to the compressor. We would need to do some experiments to verify that this is an improvement.

Unlike unpacking, the extra time spent creating these temporary files would usually be dwarfed by the time spent running the compressor at a high level.

mbargull commented 1 year ago

There are at least two ways about it:

  1. Create temporary tarballs as you described. (low level compression makes sense since compress/decompress is likely faster than no compression at all due to I/O)
  2. Read the input twice, i.e., do an equivalent of size=$( tar -cO ${file_list} | wc -c ) for the first read, i.e., avoid temp files. (even "sum of all file sizes + 512 bytes (i.e., tar header size IIRC) * number of files" would be good enough)

Downside for 2.: Could be slower than recompressing tarballs (more read I/O, esp. important for many small files) Upside for 2.: Don't have to deal with temporary files and error handling around them.

In most cases I expect the slower I/O to not be extremely relevant and avoidance of tempfile handling is a plus in my book.

dholth commented 1 year ago

One question is how bad is it if the size hint is wrong. And does the size hint really help?

I'm not sure whether or not we want to run tarfile.py twice over the same files, it would be worth measuring. If the files change between runs that would be bad.

On my laptop, even zstd -1 compression is slower than the disk. But it is really nice to use less disk space, and not everyone will have a fast disk.

dholth commented 1 year ago

This suite https://github.com/conda/conda-benchmarks/blob/main/benchmarks/peakmem_tests.py is an easy way to check memory, and could produce graphs like this if we set it up. https://dholth.github.io/conda-benchmarks/#conda_install.TimeInstall.time_explicit_install?conda-package-handling=1.9.0&conda-package-handling=2.0.0a2&p-latency=0.0&p-threads=1

dholth commented 1 year ago

I wrote a test at https://gist.github.com/dholth/0a8b26ddd361ae9a2440f19ceceaaa2e

Compression takes about the same amount of time when the size is known. The compressor produces the same number of bytes, plus a 3-byte header, for this 10MB file.

If you give the compressor the wrong number of bytes, it just fails.

If the size was not known on compression, you have to use the streaming decompression API instead of the one-shot .decompress(bytes) API, although it is possible I should be calling another "flush frame" method.

According to the scalene profiler, at level 22,

peak memory,  function
267MB oneshot
777MB onestream
266MB rightsize
774MB multistream

decompression peak memory
  9.9MB one-shot decompression
128.5MB streaming decompression, size unknown
 19.3MB streaming decompression, size known
(fails) one-shot decompression, size unknown

Memory usage is a far more reasonable ~90MB to compress and ~10-20MB to decompress at levels <= 19.

dholth commented 1 year ago

Fixed in #190 using the throwaway-TarFile strategy.

dholth commented 1 year ago

See also https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile