Open forrestfwilliams opened 1 year ago
Wow, so they actually split the stream finally. This is indeed pretty big news! Now, depending on the spacing of the seek points in the index, the saved file can be pretty big, but this does give us something practical. It puts us in a similar territory as other compressors with explicit blocks, such as zstd.
It would take some thought and work to be able to integrate this effectively.
Yes we are definitely far from any sort of implementation, but I'm glad that you're also excited about the possibilities this presents. Currently I see two incremental tasks that would be worth trying to tackle:
gzip
, use indexed_gzip
to write an unformatted side-car file. This will help us determine how to actually acquire the index information we'd need for kerchunk
.indexed_zip
package with the same basic functionality as indexed_gzip
and repeat step 1 for a zip file.I'm very interested in helping with this effort, but unfortunately I have no C or C++ experience.
Oops indexed_gzip
already includes functionality for the first task.
Exposing the zran
functionality that indexed_gzip
relies on at the module level would also be a good starting point. I've created an issue for this in the indexed_gzip
repository.
As mentioned here
kerchunk
cannot be used with zip archives that are internally compressed, because thezlib
DEFLATE compression that underlies zip archives does not support random access reads.However, I recently became aware of work by the
zlib
team that allows you to create sidecar index files for compressed archives that enable random reads (see this script). This functionality has been utilized by others to create a python package that enables random access read capabilities forgzip
archives (which also use the DEFLATE compression algorithm).I think it would be great to explore if these capabilities could be adapted and used by
kerchunk
to enable indexing of compress zip and gzip archives.