ebiggers / libdeflate

Heavily optimized library for DEFLATE/zlib/gzip compression and decompression
MIT License
1.01k stars 169 forks source link

zlib-gzip -9 compresses geo.protodata more than libdeflate-gzip -12 #4

Closed jrmuizel closed 8 years ago

jrmuizel commented 8 years ago

I get 15113 bytes from zlib and 15338 bytes from libdeflate on this uncompressed version of this file: https://github.com/quixdb/squash-benchmark/blob/master/geo.protodata.xz

ebiggers commented 8 years ago

How many times of files did you test? libdeflate uses an entirely different algorithm at level 8 and above; it will provide a better compression ratio than the faster algorithm (which is similar to zlib's algorithm) most of the time, but it will, inevitably, be possible to find an input where the opposite is true, if only slightly. There is some room for improvement in this area though.

jrmuizel commented 8 years ago

libdeflate did better on some of the other files that I tried. Interestingly enough, zlib-gzip -8 also beats libdeflate-gzip -8: 15113 vs 15901

ebiggers commented 8 years ago

Well, as I mentioned, libdeflate switches to a different algorithm at level 8 and above, and that is causing the difference for this particular file. At level 7, libdeflate uses an algorithm similar to zlib, so the difference disappears (and libdeflate actually does slightly better than zlib).

Apparently, this file is from the test data set checked into the repository for snappy. For some context, here's some more gzip compressed size stats from that data set (minimum size of each file bolded):

File zlib -7 libdeflate -7 zlib -9 libdeflate -9 libdeflate -12
alice29.txt 54278 54273 54191 52076 51941
asyoulik.txt 48863 48880 48829 46856 46729
baddata1.snappy 22936 23002 22936 22846 22844
baddata2.snappy 23016 23081 23016 22925 22917
baddata3.snappy 23721 23786 23721 23629 23623
fireworks.jpeg 122942 123078 122942 123046 123043
geo.protodata 15226 15077 15113 15824 15338
html 13692 13740 13589 13773 13261
html_x_4 53353 53392 52934 53938 51799
kppkn.gtb 38315 38270 37633 35027 34515
lcet10.txt 144585 144526 144429 138876 138337
paper-100k.pdf 81237 81259 81211 81099 81052
plrabn12.txt 194610 194883 194277 185762 185711
urls.10K 222217 222151 220207 217978 213738

As you can see, libdeflate's strongest compression mode usually beats zlib, sometimes even by a significant margin, but some files such as geo.protodata are outliers.

There are always going to be some outliers but I have some ideas in mind that might improve the situation.

lorents17 commented 8 years ago

I thank for your project! I wanted to ask one more question. How it is possible to improve compression?

libdeflate 7-zip 16.00 Ratoi
gxkkrnb.png 605 371 bytes 595 885 bytes -1,5%

7z -tgzip -mx=9 -mfb=258 -mpass=2

ebiggers commented 8 years ago

As with the other files that have been pointed out, a better block splitting algorithm is needed.

lorents17 commented 8 years ago

you plan to introduce this function in project?

ebiggers commented 8 years ago

I have released libdeflate v0.2, which includes an improved block splitting algorithm as well as some other improvements. As a result, libdeflate now compresses more, or is faster, than before on most files.

Here's an updated version of the table I posted previously, with "gxkkrnb.png" added:

File zlib -6 zlib -9 libdeflate -6 libdeflate -9 libdeflate -12
alice29.txt 54416 54182 54408 51753 51560
asyoulik.txt 48909 48790 48771 46654 46444
baddata1.snappy 22953 22953 22976 22789 22783
baddata2.snappy 23035 23035 23059 22869 22863
baddata3.snappy 23730 23730 23769 23569 23566
fireworks.jpeg 122835 122835 123010 122974 122974
geo.protodata 15143 14986 15237 15306 14887
html 13711 13570 13863 13667 13236
html_x_4 53379 52750 54237 53392 51474
kppkn.gtb 38763 37665 38802 34840 34324
lcet10.txt 144916 144451 144801 138014 137602
paper-100k.pdf 81324 81274 81033 80933 80799
plrabn12.txt 195273 194344 195510 185414 184492
urls.10K 222625 218920 223338 216715 212447
gxkkrnb.png 706109 701174 665418 604853 596797

For now I did not include 7-Zip or Zopfli since libdeflate is not currently intended to consistently match the compression ratio of programs that use the very slow approach of fully evaluating multiple block splits. libdeflate should usually get very close, though, while being much faster.

lorents17 commented 8 years ago

libdeflate vs ECT

enwik8

Compressor File Size Time
ect -2 35 350 269 byte 23.478s
gzip -12 35 100 553 byte 22.608s
ect -3 35 013 879 byte 30.082s
ect -4 34 967 607 byte 35.125s
lorents17 commented 8 years ago

tell me, is it possible to increase the compression ratio to the level of ect?

ebiggers commented 8 years ago

No, as I have already noted, despite the recent improvements, libdeflate is still not intended to fully match the compression ratio of Zopfli or Zopfli-derived programs. Also, if you have further questions or enhancement requests, please create a new Github issue. This one is closed.