Open jkbonfield opened 4 years ago
One important feature to a lot of our users is that gunzip works on the compressed file. Is that true of a file compressed with libdeflate? If I understand correctly it would not be so as the compression format is different (or is it?). -- Gene
On 11/19/19, 11:35 AM, James Bonfield wrote:
If you are wedded to using Deflate, don't use Zlib as it's simply ancient technology. I'd advise libdeflate https://github.com/ebiggers/libdeflate/ instead as generally it's over double the performance and produces compatible data streams. It also offers (at a CPU cost) higher compression levels than zlib if desired.
However better still IMO given this is a new proposal is to use Zstd https://github.com/facebook/zstd/ instead. It's a better format than Deflate offering faster compression, decompression while being generally smaller. Basically it's a win-win-win.
(Better in terms of ratio is libbsc, but it has higher CPU so that's definitely a tradeoff and may not be approproate.)
For comparisons, see https://quixdb.github.io/squash-benchmark/unstable/ which shows the Pareto frontier. Obviously esoteric tools aren't appropriate, but it permits us to see how the standard well supported tools stack up against each other. Zstd covers quite a lot of the speed vs size tradeoffs optimally.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/VGP/vgp-tools/issues/1?email_source=notifications&email_token=ABUSINR5QHKUAPD7TA33GVTQUO6Q3A5CNFSM4JPBIMF2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H2JU6FA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUSINQ6WYMFSINWFUILDWTQUO6Q3ANCNFSM4JPBIMFQ.
It supports both Zlib and Gzip encapsulation of the deflate specification. Infact libdeflate even comes with a gzip executable.
The main difference of libdeflate is the design is block based rather than a streaming with source/sink buffers. This means it can't do LZ compression between blocks of course, but this happens to fall neatly into our use case anyway.
An example of compressing a VCF using zlib vs libdeflate and then decompressing each others output.
$ time ./bgzip.libdeflate -@8 < /tmp/a.vcf > /tmp/a.vcf.libdeflate.gz
real 0m14.559s
user 1m40.961s
sys 0m10.068s
$ time ./bgzip.zlib -@8 < /tmp/a.vcf > /tmp/a.vcf.zlib.gz
real 0m33.221s
user 3m25.225s
sys 1m5.752s
-rw-r--r-- 1 jkb team117 18435621310 Nov 28 11:46 /tmp/a.vcf
-rw-r--r-- 1 jkb team117 349434708 Nov 28 11:49 /tmp/a.vcf.libdeflate.gz
-rw-r--r-- 1 jkb team117 341851007 Nov 28 11:50 /tmp/a.vcf.zlib.gz
$ time ./bgzip.zlib -t -@2 /tmp/a.vcf.libdeflate.gz
real 0m15.568s
user 0m32.444s
sys 0m2.804s
$ time ./bgzip.libdeflate -t -@2 /tmp/a.vcf.zlib.gz
real 0m11.812s
user 0m25.531s
sys 0m2.615s
# And for good measure, the system gzip utility vs libdeflate
$ time gzip -d < /tmp/a.vcf.libdeflate.gz > /dev/null
real 1m5.170s
user 1m4.730s
sys 0m0.400s
$ time ~/ftp/compression/libdeflate/gzip -d < /tmp/a.vcf.libdeflate.gz > /dev/null
real 0m20.680s
user 0m20.550s
sys 0m0.120s
Test machine was Ubuntu Bionic with 16x 2.6Gb Intel Broadwell CPUs.
I've no idea why the system gzip is so much slower than bgzip linked against the system zlib. Baffling.
Oh, and the benefits of not wedding ourselves to an ancient legacy format. The same file with the default zlib compression level:
$ time zstd < /tmp/a.vcf > /tmp/a.vcf.zstd
real 0m22.768s
user 0m15.274s
sys 0m21.140s
$time zstd -d < /tmp/a.vcf.zstd > /dev/null
real 0m8.803s
user 0m8.506s
sys 0m0.296s
-rw-r--r-- 1 jkb team117 341851007 Nov 28 11:50 /tmp/a.vcf.zlib.gz
-rw-r--r-- 1 jkb team117 227056198 Nov 28 12:00 /tmp/a.vcf.zstd
Or with comparable speed to libdeflate, turning up the compression level (it goes up to 22, but default is 3 I think):
$ time zstd -T8 -9 < /tmp/a.vcf > /tmp/a.vcf.zstd
real 0m13.803s
user 1m43.484s
sys 0m8.338s
-rw-r--r-- 1 jkb team117 188613044 Nov 28 12:05 /tmp/a.vcf.zstd
$ time zstd -d < /tmp/a.vcf.zstd > /dev/null
real 0m7.255s
user 0m6.915s
sys 0m0.329s
If you are wedded to using Deflate, don't use Zlib as it's simply ancient technology. I'd advise libdeflate instead as generally it's over double the performance and produces compatible data streams. It also offers (at a CPU cost) higher compression levels than zlib if desired.
However better still IMO given this is a new proposal is to use Zstd instead. It's a better format than Deflate offering faster compression, decompression while being generally smaller. Basically it's a win-win-win.
(Better in terms of ratio is libbsc, but it has higher CPU so that's definitely a tradeoff and may not be approproate.)
For comparisons, see https://quixdb.github.io/squash-benchmark/unstable/ which shows the Pareto frontier. Obviously esoteric tools aren't appropriate, but it permits us to see how the standard well supported tools stack up against each other. Zstd covers quite a lot of the speed vs size tradeoffs optimally.