gdraheim / zziplib

The ZZIPlib provides read access on ZIP-archives and unpacked data. It features an additional simplified API following the standard Posix API for file access
Other
60 stars 50 forks source link

Faster lookup of entries is large file-count Zip #147

Open yairlenga opened 1 year ago

yairlenga commented 1 year ago

Looking for feedback for the following problem:

I have a large hierarchy (>1M files, each 10K compressed), zipped into a logical "dataset". Individual files represent simulation results - semi structured. (side note: I've tried storing it into Parquet file, but performance for getting subset of the data).

When reading the Zip file, code is spending lot of time on reading the central dir. When retrieving large number of experiments (e.g., 100+) - the one time cost of the central dir read is reasonable (amortized across all read). However, when looking up few experiments (or just one), the cost of reading the central dir (measured to be double digit MB), outweigh the cost of reading the file, resulting in poor performance.

Did some research, and I understand that there is no "generic" solution, as the central directory must be ready sequentially (variable length entries, no "block markers", unsorted content). Hoping to get feedback/ideas if possible to build something more efficient and, leveraging the "virtualization" of the zziplib, to speed up processing. Basic idea:

  1. Take the original Zip file.
  2. Create "alternative" directory structure that can be searched in "binary" way (sort entries by name, make all entry fixed size, create index)
  3. Store the "alternative" directory as an entry in the Zip file.
  4. Somehow (how ?) - avoid reading the "real" central dir, and do binary search inside the "Alternative" directory to locate entry information
  5. Use the entry information to extract/inflate/... the real experiment.

Basic idea is that file stay compatible with standard zip tools, but have a "secret" path for fast access.

In theory, 1M entries will require reading less than 200K of "alternate" directory, instead of 40MB, which regular "unzip" is doing.

Any ideas/feedback on how to extend/leverage zziplib using ext-io to achieve the above speed ?