dylex / zip-stream

Haskell ZIP archive streaming processing using conduit
BSD 3-Clause "New" or "Revised" License
7 stars 8 forks source link

Support 0-compressed ZIP files #12

Open nh2 opened 2 years ago

nh2 commented 2 years ago

I'd like to compress ZIP files without compression, since I need to provide the format, but the ZIP comprssion algorithm even on lowest level 1 only does ~25 MB/s on modern CPUs, and uncompressed is incredibly much faster.

The docs say:

It does not (ironically) support uncompressed zip files that have been created as streams, where file sizes are not known beforehand.

I don't quite understand what exactly that means; when zipping with this library, the file sizes are known to it beforehand, aren't they?

See also potentially related (?) #4.

I think it would make sense to have this issue to track this feature of the library being able to decompress its own files.

dylex commented 2 years ago

If you just want to create zip files without compression, setting compress level 0 (stored) should work fine. Similarly, uncompressing most level 0 zip files with unzip will work okay. The issue is only when uncompressing stored+streamed zip files, since such files have no way of knowing the size of the data without reading the footer TOC at the end of the file. It's really just a zip file format limitation.

nh2 commented 2 years ago

@dylex So are you saying, for zip-stream to support uncompressing level 0 streamed zip files, the Conduit would have to either buffer the entire input in RAM, or have to know that it's a stored file (e.g. file path on disk) with random-access read to its end?

dylex commented 2 years ago

Right, it would have to read the end of the file before being able to extract anything. Since that is antithetical to streaming, and there are many other good libraries for accessing zip files on disk, it doesn't seem worth adding a separate interface to allow it. I would be open to improving the handling of this situation, though (to produce a better error where you can fall back to buffering the whole thing somehow and using a different solution). Right now it just fails the conduit which is not great.

If you wanted to create a zip, even by streaming, where the size and crc-32 of all files is known ahead of time, you could do it in a way that would be supported for unzipping. The zip side of the library doesn't currently support this (because there's no optional crc32 field), but that would be fairly easy to add.