VGP / vgp-tools

Other
28 stars 6 forks source link

Alternative libraries for VGPzip #1

Open jkbonfield opened 4 years ago

jkbonfield commented 4 years ago

If you are wedded to using Deflate, don't use Zlib as it's simply ancient technology. I'd advise libdeflate instead as generally it's over double the performance and produces compatible data streams. It also offers (at a CPU cost) higher compression levels than zlib if desired.

However better still IMO given this is a new proposal is to use Zstd instead. It's a better format than Deflate offering faster compression, decompression while being generally smaller. Basically it's a win-win-win.

(Better in terms of ratio is libbsc, but it has higher CPU so that's definitely a tradeoff and may not be approproate.)

For comparisons, see https://quixdb.github.io/squash-benchmark/unstable/ which shows the Pareto frontier. Obviously esoteric tools aren't appropriate, but it permits us to see how the standard well supported tools stack up against each other. Zstd covers quite a lot of the speed vs size tradeoffs optimally.

thegenemyers commented 4 years ago

One important feature to a lot of our users is that gunzip works on the compressed file. Is that true of a file compressed with libdeflate? If I understand correctly it would not be so as the compression format is different (or is it?). -- Gene

On 11/19/19, 11:35 AM, James Bonfield wrote:

If you are wedded to using Deflate, don't use Zlib as it's simply ancient technology. I'd advise libdeflate https://github.com/ebiggers/libdeflate/ instead as generally it's over double the performance and produces compatible data streams. It also offers (at a CPU cost) higher compression levels than zlib if desired.

However better still IMO given this is a new proposal is to use Zstd https://github.com/facebook/zstd/ instead. It's a better format than Deflate offering faster compression, decompression while being generally smaller. Basically it's a win-win-win.

(Better in terms of ratio is libbsc, but it has higher CPU so that's definitely a tradeoff and may not be approproate.)

For comparisons, see https://quixdb.github.io/squash-benchmark/unstable/ which shows the Pareto frontier. Obviously esoteric tools aren't appropriate, but it permits us to see how the standard well supported tools stack up against each other. Zstd covers quite a lot of the speed vs size tradeoffs optimally.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/VGP/vgp-tools/issues/1?email_source=notifications&email_token=ABUSINR5QHKUAPD7TA33GVTQUO6Q3A5CNFSM4JPBIMF2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H2JU6FA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUSINQ6WYMFSINWFUILDWTQUO6Q3ANCNFSM4JPBIMFQ.

jkbonfield commented 4 years ago

It supports both Zlib and Gzip encapsulation of the deflate specification. Infact libdeflate even comes with a gzip executable.

The main difference of libdeflate is the design is block based rather than a streaming with source/sink buffers. This means it can't do LZ compression between blocks of course, but this happens to fall neatly into our use case anyway.

An example of compressing a VCF using zlib vs libdeflate and then decompressing each others output.

$ time ./bgzip.libdeflate -@8 < /tmp/a.vcf > /tmp/a.vcf.libdeflate.gz
real    0m14.559s
user    1m40.961s
sys 0m10.068s

$ time ./bgzip.zlib -@8 < /tmp/a.vcf > /tmp/a.vcf.zlib.gz
real    0m33.221s
user    3m25.225s
sys 1m5.752s

-rw-r--r-- 1 jkb team117 18435621310 Nov 28 11:46 /tmp/a.vcf
-rw-r--r-- 1 jkb team117   349434708 Nov 28 11:49 /tmp/a.vcf.libdeflate.gz
-rw-r--r-- 1 jkb team117   341851007 Nov 28 11:50 /tmp/a.vcf.zlib.gz

$ time ./bgzip.zlib -t -@2 /tmp/a.vcf.libdeflate.gz
real    0m15.568s
user    0m32.444s
sys 0m2.804s

$ time ./bgzip.libdeflate -t -@2 /tmp/a.vcf.zlib.gz
real    0m11.812s
user    0m25.531s
sys 0m2.615s

# And for good measure, the system gzip utility vs libdeflate
$ time gzip -d < /tmp/a.vcf.libdeflate.gz > /dev/null
real    1m5.170s
user    1m4.730s
sys 0m0.400s

$ time ~/ftp/compression/libdeflate/gzip -d < /tmp/a.vcf.libdeflate.gz > /dev/null
real    0m20.680s
user    0m20.550s
sys 0m0.120s

Test machine was Ubuntu Bionic with 16x 2.6Gb Intel Broadwell CPUs.

I've no idea why the system gzip is so much slower than bgzip linked against the system zlib. Baffling.

jkbonfield commented 4 years ago

Oh, and the benefits of not wedding ourselves to an ancient legacy format. The same file with the default zlib compression level:

$ time zstd < /tmp/a.vcf > /tmp/a.vcf.zstd
real    0m22.768s
user    0m15.274s
sys 0m21.140s

$time zstd -d < /tmp/a.vcf.zstd > /dev/null
real    0m8.803s
user    0m8.506s
sys 0m0.296s

-rw-r--r-- 1 jkb team117   341851007 Nov 28 11:50 /tmp/a.vcf.zlib.gz
-rw-r--r-- 1 jkb team117   227056198 Nov 28 12:00 /tmp/a.vcf.zstd

Or with comparable speed to libdeflate, turning up the compression level (it goes up to 22, but default is 3 I think):

$ time zstd -T8 -9 < /tmp/a.vcf > /tmp/a.vcf.zstd
real    0m13.803s
user    1m43.484s
sys 0m8.338s

-rw-r--r-- 1 jkb team117 188613044 Nov 28 12:05 /tmp/a.vcf.zstd

$ time zstd -d < /tmp/a.vcf.zstd > /dev/null
real    0m7.255s
user    0m6.915s
sys 0m0.329s