Closed nh2 closed 1 year ago
Thanks, I think this all makes sense. I made some purely cosmetic changes (prefer explicit strictness, record wild cards) but kept all the same logic.
Thanks!
Releasing this would be so good.. @dylex :pray:
Sorry, this was included in the 0.2.2.0 release on hackage in November (I think), I just forgot to push the commit with the version bump
Ah!.. Thanks a lot :+1:
Thanks for this library, it is very useful. Here comes a contribution :)
The docs say:
This is false so far.
All
ZipDataByteString
are kept in memory until the ZIP is finished, thus OOMing servers that try to stream large ZIP archives.The reason for this are various space leaks in:
State
monad used to track the number of written bytes;Control.Monad.Trans.Strict
docs say:csz
.P.putWord16le
call capturing a reference to the ByteString indat
.The package so far has no test suite to ensure that the "file data is never kept in memory" claim is actually true, so I added one. (More tests should be added e.g. to property-test round-trip encoding.)
My commit
Fix constant-memory streaming not working for ZipDataByteString
fixes the issues above with the minimal possible changes.For easiest review, please see the individual commits.
However, I don't think the library should keep it at that.
The correctness of a constant-memory-streaming library should not depend on some sparsely and arcanely placed
!
symbols. Instead, I think that:return $ do ...
), so that they can be easily strictified, versus being able to capture everything from the surrounding code -- that easily leads to bugs like this. Instead of passing functions, pass plain data types with strict fields (e.g.StrictData
) whose memory usage and lifetime can be reasoned about.Concretely:
Capture all variables from this block https://github.com/dylex/zip-stream/blob/735fe015a4a19b4c93b096a8761efb206b6b89f4/Codec/Archive/Zip/Conduit/Zip.hs#L171-L200
in data types such as
and replace the block mentioned by:
This guarantees absence of thunk-related space leaks.
I have included this fix as the commit
zipStream: Ensure absence of space leaks via defunctionalisation
.This refactoring has another benefit: We can now conclude very easily how many bytes are retained per entry until the end of the archive, simply by inspecting the fields of
CentralDirectoryInfo
. You already wrote in the docsand now we can easily reason about whether that's really true. It also makes it easier to optimise this storage, e.g. the various
Bool
s andWord16/32
s are all stored as 64-bit values by GHC, and they could be bit-packed for additional memory savings.Cheers! Niklas