99991 / ParallelZipFile

Quick hacky parallel ZipFile implementation
MIT License
3 stars 1 forks source link

This is an mmap-based ParallelZipFile implementation since Python's ZipFile is currently (2022-01-01) not thread safe.

Gotchas:

Example

Example for reading and checking file integrity of files in a zip archive in parallel using a ThreadPool. Just copy parallelzipfile.py into your project directory and you are good to go.

import zlib
from multiprocessing.pool import ThreadPool

from parallelzipfile import ParallelZipFile as ZipFile

def do_something_with_file(info):
    """Checking file integrity."""

    data = z.read(info.filename)

    computed_crc = zlib.crc32(data)

    assert computed_crc == info.CRC

with ZipFile("example.zip") as z:
    with ThreadPool() as pool:
        pool.map(do_something_with_file, z.infolist())

Benchmark

Benchmark

This plot shows how long it takes to process a 10 MB zip archive containing files of increasing size with 1, 2, 4 or 8 threads using ZipFile or ParallelZipFile. The zip archive contains fewer files as the file size of the contained individual files grows to keep the total size of the zip archive approximately the same (header sizes not considered).

Benchmark details

Benchmarks were run on an Intel Core i5-10300H processor (4 cores) on Xubuntu 20.04. Results are the average of ten runs (median looks about the same). All data is "hot", i.e. cached in RAM.

TODO

Find out why single threaded performance is higher than multi-threaded performance for small files. The following points have been investigated so far: