Piezoid / pugz

Truly parallel gzip decompression
MIT License
124 stars 11 forks source link
cpp decompression gzip library parallel

pugz

Parallel decompression of gzipped text files.

Decompresses text files with a truly parallel algorithm in two passes. (paper for details)

Getting Started

A Linux system on a recent x86_64 CPU is required.

Installing

Type:

make

For maximal performance, disable assertions with:

make asserts=0

Usage

./gunzip -t 8 file.gz

Counting lines is incredibly faster, because there is no thread synchronization:

./gunzip -l -t 8 file.gz

Test

We provide a small example:

cd example
bash test.sh

Decompression speed benchmark

Speed is measured in input compressed bytes processed per second.

Threads gunzip pugz, full decompression pugz, only counting lines
1 55 MB/s 147 MB/s 145 MB/s
3 - 205 MB/s 291 MB/s
6 - 228 MB/s 515 MB/s
12 - 248 MB/s 769 MB/s
24 - 251 MB/s 1052 MB/s

Specs: dual Xeon X5675 (2x 6C/12T), 32 GB RAM NUMA, SSD

On a more recent desktop computer (i7-4770 4C/8T, 16 GB RAM, SSD):

Threads gunzip pugz, full decompression pugz, only counting lines
1 63 MB/s 172 MB/s 170 MB/s
8 - 376 MB/s 565 MB/s

Tested on a 2.7 GB compressed file, with similar results on a 24 GB file.

Script: https://github.com/Piezoid/pugz/blob/master/example/bigger_benchmark.sh

Algorithm overview

Contrary to the pigz program which does single-threaded decompression (see https://github.com/madler/pigz/blob/master/pigz.c#L232), pugz found a way to do truly parallel decompression. In a nutshell: the compressed file is splitted into consecutive sections, processed one after the other. Sections are in turn splitted into chunks (one chunk per thread) and will be decompressed in parallel. A first pass decompresses chunks and keeps track of back-references (see e.g. our paper for the definition of that term), but is unable to resolve them. Then, a quick sequential pass is done to resolve the contexts of all chunks. A final parallel pass translates all unresolved back-references and outputs the file.

Roadmap/TODOs

This is a prototype for proof of concept, so expect some rough corners.

If pugz chokes on some of your large files that you are willing to share, please fill a issue !

License

This project is licensed under the MIT License.

Citation

@inproceedings{pugz,
author = {Kerbiriou, Mael and Chikhi, Rayan},
year = {2019},
month = {05},
pages = {209-217},
title = {Parallel Decompression of Gzip-Compressed Files and Random Access to DNA Sequences},
doi = {10.1109/IPDPSW.2019.00042},
booktitle={2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}
}

Acknowledgements

ebiggers for writing libdeflate