Support for `zlib` compressed archives (`gzip`/`zip`)

fsspec / kerchunk

Cloud-friendly access to archival data

https://fsspec.github.io/kerchunk/

MIT License

304 stars 78 forks source link

Support for `zlib` compressed archives (`gzip`/`zip`) #281

Open forrestfwilliams opened 1 year ago

forrestfwilliams commented 1 year ago

As mentioned here kerchunk cannot be used with zip archives that are internally compressed, because the zlib DEFLATE compression that underlies zip archives does not support random access reads.

However, I recently became aware of work by the zlib team that allows you to create sidecar index files for compressed archives that enable random reads (see this script). This functionality has been utilized by others to create a python package that enables random access read capabilities for gzip archives (which also use the DEFLATE compression algorithm).

I think it would be great to explore if these capabilities could be adapted and used by kerchunk to enable indexing of compress zip and gzip archives.

martindurant commented 1 year ago

Wow, so they actually split the stream finally. This is indeed pretty big news! Now, depending on the spacing of the seek points in the index, the saved file can be pretty big, but this does give us something practical. It puts us in a similar territory as other compressors with explicit blocks, such as zstd.

It would take some thought and work to be able to integrate this effectively.

forrestfwilliams commented 1 year ago

Yes we are definitely far from any sort of implementation, but I'm glad that you're also excited about the possibilities this presents. Currently I see two incremental tasks that would be worth trying to tackle:

Using a file compressed via gzip, use indexed_gzip to write an unformatted side-car file. This will help us determine how to actually acquire the index information we'd need for kerchunk.
Create an indexed_zip package with the same basic functionality as indexed_gzip and repeat step 1 for a zip file.

I'm very interested in helping with this effort, but unfortunately I have no C or C++ experience.

forrestfwilliams commented 1 year ago

Oops indexed_gzip already includes functionality for the first task.

forrestfwilliams commented 1 year ago

Exposing the zran functionality that indexed_gzip relies on at the module level would also be a good starting point. I've created an issue for this in the indexed_gzip repository.