dylex / zip-stream

Haskell ZIP archive streaming processing using conduit
BSD 3-Clause "New" or "Revised" License
7 stars 8 forks source link

Fix constant-memory compression not working for `ZipDataByteString` #13

Closed nh2 closed 1 year ago

nh2 commented 1 year ago

Thanks for this library, it is very useful. Here comes a contribution :)

The docs say:

file data is never kept in memory (beyond a single ZipDataByteString)

This is false so far.

All ZipDataByteString are kept in memory until the ZIP is finished, thus OOMing servers that try to stream large ZIP archives.

The reason for this are various space leaks in:

The package so far has no test suite to ensure that the "file data is never kept in memory" claim is actually true, so I added one. (More tests should be added e.g. to property-test round-trip encoding.)

My commit Fix constant-memory streaming not working for ZipDataByteString fixes the issues above with the minimal possible changes.

For easiest review, please see the individual commits.


However, I don't think the library should keep it at that.

The correctness of a constant-memory-streaming library should not depend on some sparsely and arcanely placed ! symbols. Instead, I think that:

Concretely:

Capture all variables from this block https://github.com/dylex/zip-stream/blob/735fe015a4a19b4c93b096a8761efb206b6b89f4/Codec/Archive/Zip/Conduit/Zip.hs#L171-L200

in data types such as

{-# LANGUAGE StrictData #-}
-- ...

data CommonFileHeaderInfo = CommonFileHeaderInfo
  { cfhi_isStreamingEntry :: Bool
  , cfhi_hasUtf8Filename :: Bool
  , cfhi_isCompressed :: Bool
  , cfhi_time :: Word16
  , cfhi_date :: Word16
  } deriving (Eq, Ord, Show)

data CentralDirectoryInfo = CentralDirectoryInfo
  { cdi_off :: Word64
  , cdi_z64 :: Bool
  , cdi_commonFileHeaderInfo :: CommonFileHeaderInfo
  , cdi_crc :: Word32
  , cdi_usz :: Word64
  , cdi_name :: BSC.ByteString
  , cdi_csz :: Word64
  , cdi_zipEntryExternalAttributes :: Maybe Word32 -- lazy Maybe must be e.g. via `force` at creation
  } deriving (Eq, Ord, Show)

putCommonFileHeaderPart :: CommonFileHeaderInfo -> P.PutM ()
-- ...
putCentralDirectory :: CentralDirectoryInfo -> P.PutM ()
-- ...

and replace the block mentioned by:

    let !centralDirectoryInfo = CentralDirectoryInfo
          { cdi_off = off
          , cdi_z64 = z64
          , cdi_commonFileHeaderInfo = commonFileHeaderInfo
          , cdi_crc = crc
          , cdi_usz = usz
          , cdi_name = name
          , cdi_csz = csz
          , cdi_zipEntryExternalAttributes = force zipEntryExternalAttributes
          }
    return $ putCentralDirectory centralDirectoryInfo

This guarantees absence of thunk-related space leaks.

I have included this fix as the commit zipStream: Ensure absence of space leaks via defunctionalisation.

This refactoring has another benefit: We can now conclude very easily how many bytes are retained per entry until the end of the archive, simply by inspecting the fields of CentralDirectoryInfo. You already wrote in the docs

consuming an additional ~100 bytes of state per entry during streaming

and now we can easily reason about whether that's really true. It also makes it easier to optimise this storage, e.g. the various Bools and Word16/32s are all stored as 64-bit values by GHC, and they could be bit-packed for additional memory savings.

Cheers! Niklas

dylex commented 1 year ago

Thanks, I think this all makes sense. I made some purely cosmetic changes (prefer explicit strictness, record wild cards) but kept all the same logic.

nh2 commented 1 year ago

Thanks!

ulidtko commented 1 year ago

Releasing this would be so good.. @dylex :pray:

dylex commented 1 year ago

Sorry, this was included in the 0.2.2.0 release on hackage in November (I think), I just forgot to push the commit with the version bump

ulidtko commented 1 year ago

Ah!.. Thanks a lot :+1: