Compression levels adjustment

tansy commented 2 years ago

I noticed that in 1.9 version compression levels "overlap", I mean some of them are basically the same. I took the silesia corpus* and that's he result:

Compressor name	Compr. size	Ratio	Filename
memcpy	211957760	100.00	silesia.tar
libdeflate 1.9 -1	73503035	34.68	silesia.tar
libdeflate 1.9 -2	71070103	33.53	silesia.tar
libdeflate 1.9 -3	70170668	33.11	silesia.tar
libdeflate 1.9 -4	69471739	32.78	silesia.tar
libdeflate 1.9 -5	68171764	32.16	silesia.tar
libdeflate 1.9 -6	67510595	31.85	silesia.tar
libdeflate 1.9 -7	67141683	31.68	silesia.tar
libdeflate 1.9 -8	66766242	31.50	silesia.tar
libdeflate 1.9 -9	66716614	31.48	silesia.tar
libdeflate 1.9 -10	64786046	30.57	silesia.tar
libdeflate 1.9 -11	64710756	30.53	silesia.tar
libdeflate 1.9 -12	64687172	30.52	silesia.tar

Compression in levels 8 and 9, 11 and 12 are almost the same - difference in ratio of 0.02% and 0.01% is hardly noticable. Level 10 is not much different than 11 either. Difference between 10-12 is ~0.05% and between 9 and 10 is almost 1%.

First I decided to leave levels 6, 9, and 12 as they are and spread those in between by ratio. Also because level 9 is now the line between lazy/2 and near_optimal algorithms. First I thought even spread would be good but then I realised that "logarithmic", or something like that, would be better as it would resemble existing ones. So I calculated new ratios to be 1/2, 3/4 (and 1) of the gap between 6 and 9 (31.85 - 31.48) as well as 9 and 12 (31.48 - 30.52).

lv	x	x+1	x+2	y
x+(y-x)	0	1/2	3/4	1
lv	6	7	8	9
ratio	31.85	31.66	31.57	31.48
lv	9	10	11	12
ratio	31.48	31.00	30.76	30.52

That would be "ideal" to look for.

First I took the 9-12 levels range and checked what were the results for v1.8. After some tweek I got to the point where they were almost perfectly matched. Then with levels 6-9 it wasn't that easy but I brought it to acceptable point. Now the results are like this:

Compressor name	Compr. size	Ratio	Filename
memcpy	211957760	100.00	silesia.tar
libdeflate 1.10-1 -1	73503035	34.68	silesia.tar
libdeflate 1.10-1 -2	71070103	33.53	silesia.tar
libdeflate 1.10-1 -3	70170668	33.11	silesia.tar
libdeflate 1.10-1 -4	69471739	32.78	silesia.tar
libdeflate 1.10-1 -5	68171764	32.16	silesia.tar
libdeflate 1.10-1 -6	67510595	31.85	silesia.tar
libdeflate 1.10-1 -7	67155164	31.68	silesia.tar
libdeflate 1.10-1 -8	66850226	31.54	silesia.tar
libdeflate 1.10-1 -9	66716614	31.48	silesia.tar
libdeflate 1.10-1 -10	65724812	31.01	silesia.tar
libdeflate 1.10-1 -11	65030245	30.68	silesia.tar
libdeflate 1.10-1 -12	64685969	30.52	silesia.tar

Deltas calculated for it are as follows:

lv	6	7	8	9
d(6-9)	0	0.46	0.84	1
lv	9	10	11	12
d(9-12)	0	0.49	0.83	1

The results are spread bit more "evenly" and the gap between 9 and 10 is halved.

Similar results are for other data sets.

I think you should consider to adjust these compression levels to that or something similar.

Here are the changes to deflate_compress.c in diff. I will do pull request if that is what you are interested in.

* I chose it as it is diverse, non homogeneous, relatively big corpus that resembles real life date, imo best for general purpose compressor. I tested other corpora available on net and the results were very similar, almost the same. They include enwik, lukas medical images and my "own", namely app (mozilla - 64bit executables from silesia corpus, google earth 32-bit for windows and firefox for linux), png-dec (bunch of decompressed png images) and html/css/js (bunch of sites styles and scripts; something that imitates html pages).

** To produce results I used lzbench with libdeflate-1.9 support.

ebiggers commented 2 years ago

First, note that the silesia corpus is not great, especially when it is compressed all as one file, as that gives very different weights to the files inside it. It's also a bit text-heavy. Out of habit, I do still compress the whole silesia corpus as one of my tests. But I tested libdeflate v1.9 on many other files too, many more than previous versions of libdeflate.

I think the result you're seeing is largely the natural consequence of the algorithms currently available. Fundamentally, it is hard to fill the gap between lazy and near-optimal parsing. Previous versions of libdeflate tried to fill this gap at levels 8-9 by running the near-optimal parser with weak parameters. Your suggestion does basically the same, but does it for levels 10-11 instead. However, this does very badly on some data, worse than lazy parsing. E.g., see #85. Here are results from the file from #85 with and without your proposed patch:

Level	libdeflate v1.9-patched	libdeflate v1.9	zlib
1	111669 / 4	111669 / 4	118729 / 6
2	106769 / 6	106769 / 5	110383 / 7
3	99898 / 3	99898 / 3	102669 / 7
4	92318 / 6	92318 / 6	97863 / 12
5	91267 / 4	91267 / 4	90502 / 13
6	81270 / 8	81270 / 8	74639 / 20
7	74637 / 13	74067 / 14	73230 / 24
8	72515 / 21	71515 / 30	71121 / 44
9	71003 / 26	71003 / 44	71129 / 48
10	88324 / 67	72572 / 106	N/A
11	73515 / 112	69899 / 218	N/A
12	68515 / 441	68515 / 413	N/A

(Cells show compressed size / time in milliseconds.)

So, with your proposed patch, level 10 would compress worse than level 6 on this file, despite being 8x as slow. libdeflate v1.8 and earlier had essentially the same issue for the same reason, but with the drop-off occuring at level 8.

Here are the results from a file containing DNA sequencing reads:

Level	libdeflate v1.9-patched	libdeflate v1.9	zlib
1	8627220 / 383	8627220 / 386	11185034 / 477
2	7389117 / 482	7389117 / 481	10172931 / 520
3	6517790 / 530	6517790 / 506	9300280 / 815
4	6106557 / 657	6106557 / 681	7606177 / 919
5	6043494 / 611	6043494 / 572	6904720 / 1452
6	5918475 / 995	5918475 / 963	6658313 / 3130
7	5885276 / 2278	5870664 / 2496	6592924 / 5315
8	5834253 / 3626	5771690 / 4851	6448238 / 11288
9	5767417 / 5555	5767417 / 5682	6427283 / 12526
10	5925121 / 4648	5698886 / 7324	N/A
11	5709114 / 6983	5642186 / 11360	N/A
12	5575386 / 21173	5575386 / 21234	N/A

Similarly, with your proposed patch, level 10 would compress worse than level 6 on this file. In contrast, unpatched v1.9 scales pretty nicely. Note that currently, DNA sequencing data is one of the most common data-sets people are using with libdeflate (AFAIK).

It may be instructive to include zlib in the comparison with silesia as well:

Level	libdeflate v1.9	zlib
1	73505695 / 1150	77261079 / 2062
2	71070866 / 1381	75002464 / 2259
3	70170808 / 1485	72969695 / 2934
4	69472978 / 1577	71003558 / 3316
5	68170275 / 1859	69163650 / 4663
6	67511903 / 2257	68229313 / 6592
7	67141845 / 2951	67940580 / 8036
8	66766245 / 4722	67695799 / 12096
9	66715225 / 5974	67644637 / 15801
10	64790395 / 20745	N/A
11	64705562 / 26929	N/A
12	64689492 / 33595	N/A

From what I've seen, the most common use case of libdeflate is that someone is using zlib at a particular level and they want to switch to libdeflate at the same level and get a big speed improvement, while keeping the same or slightly better compression ratio. The second most common use case is that people want to use level 12.

The latest libdeflate serves both of those use cases pretty nicely, better than previous versions IMO. Note that the gap between levels 9-10 doesn't really matter for these use cases. So while not ideal, I think it is better than having the previous issues.

I think that filling this gap would require a substantially new algorithm, not just tuned parameters. "lazy2" (the algorithm now used by levels 8-9) was sort of intended to be that, but it turned out to be more of an incremental improvement over the regular "lazy".

tansy commented 2 years ago

I get the point about (current) levels 10 - 12 (near_optimal). I will (still) investigate it further to check more details. I still stand by my opinion regarding levels 6 - 9. Even with your odd data sample it does a good job, better than current parameters; only level/s 10+ are anomaly.

You still cannot adjust whole program to some exceptional cases. If these minecraft data are "dynamic", to be compressed as packets, noone will use levels 10+. Not even 7+. Packet compression is meant to be fast rather that tight. And, again, if these are "static" data to be compressed one and then just decompressed they will take a closer look into the compression levels and carefully choose one they like the most and will not care if the are some local anomalies - they will take a time to study it closely.

Speaking of anomalies, you can check those medical images and you will see funny things add well. Still none tells you to adapt to this data set.

ebiggers commented 2 years ago

I still stand by my opinion regarding levels 6 - 9. Even with your odd data sample it does a good job, better than current parameters; only level/s 10+ are anomaly.

I'm not sure what you're referring to here, give that your patch would weaken levels 7-8, and not change levels 6 and 9. Did you maybe mix up the columns? libdeflate v1.9-patched is with your patch applied, while libdeflate v1.9 is without it. Are you talking about performance only?

You still cannot adjust whole program to some exceptional cases.

One person's exceptional case is another person's normal case :-)

Anyway, please also look at the absolute numbers and not just how the compression levels compare to each other. There were a lot of improvements in v1.9, and libdeflate does even better compared to zlib now. On silesia you can see that libdeflate level 6 now compresses more than zlib level 9.

tansy commented 2 years ago

I'm not sure what you're referring to here, give that your patch would weaken levels 7-8, and not change levels 6 and 9

In simple words, yes. Like I said levels 6 and 9 (and 12 for that matter) stay the same. Level 6 because it is tweeked as default; 9 because it's maximum compression for lazy/2 algorithms, so there is nothing to improve; 12 essentially the same as 9.

Only "issue" are levels 7 and 8 that are "clumped" and is most of the time level 8 virtually indistinguishable from 9. That's what I meant. Situation analogical to levels 9-12, with the "gap" between 9 and 10, here between 6 and 7, and 11-12 close to each other, here 8-9. Levels 7-8 have adjusted parameters, mainly nice_match_length that grows exponentially - multiplied by 1.6 so it gives:

lv	6	7	8	9
1.10-1 nice	65	104	160	266 => 258

I tested it with DNA samples various sources (specified in paste) and for apps2 (not so text heavy) D(8-9)=18k/170M (0.01%), after change 56k (0.03%). It can be seen in apps2 corpus where differences in ratio in between the levels go:

lv	6	7	8	9
1.09		0.11	0.07	0.01
1.10-1		0.10	0.05	0.03

(1.09 denotes original 1.09/1.10 version, 1.10-1 stands for adjusted)

Similarly with dna-fasta format (formatted ACGT); even though compression ratio changes significantly between levels.

lv	6	7	8	9
1.09		4.28	3.50	0.26
1.10-1		4.09	2.37	1.58

With dna-1245 corpus it's really hard to see as it's another anomalous set where level 6 is the best (for lv≤9). Same is true with medical images (lukas_2d...) where level 4 is the best (for lv≤9), then 7, then 9.

Again, it can be only seen in more general data sets rather than specialized, specific examples that are "misbahaving" due to nature of deflate compressor (even zlib behaves )

You won't be changing parameters to adapt to dicom images just because they don't perform in linear fashion, will you?

PS. Last part is compressing png images. Bunch of decompressed images, which turns out not to be much telling add it is, more less, general set, one image, and animation compressed every file separately, as it appears in apng.

PPS. Because you said libdeflate is used to compress dna sequences I tried it and it seems that it really depends on format (see dna-virus/cvirus.fasta) used and probably and to lesser extent on actual data, how they act during compression. From this test it can be concluded that: It behaves like more general data for pizzachilli/dna.50MB corpus, with little difference between levels 8 and 9 @v1.9 (and and even less different between 10-12 @v1.9). With local minimum at lv6 for dnacorpus/dnacorpus-60M.tar @v1.9/v1.10-1 (and global minimum at lv10 @v1.9, at lv 11 @v1.10-1; which are more less the same) For dna-virus/cvirus.2bit-50Mi there is virtually no difference between levels 8 and 9 @v1.9. For dna-virus/cvirus.fasta-50Mi there is very little difference between levels 8 and 9 @v1.9 (and practically no difference between lv 10-11-12 @v1.9). For proteins and cvirus-2bit levels 7-9 are faster than level 6.

[corpora]:

dnacorpus pizzachilli/dna virus dna pizzachilli/protein protein

[ed]. what I also noticed is that max_search_depth = 136 for level 12 often works better than 150. This one I don't get really.

The last conclusion came to me after seeing this:

Compressor name	Compr. size	Ratio	Filename
memcpy	1060424	100.00	required_block_states.nbt
xz 5.2.5 -0	68443	6.45	required_block_states.nbt
xz 5.2.5 -1	59085	5.57	required_block_states.nbt
xz 5.2.5 -2	50506	4.76	required_block_states.nbt
xz 5.2.5 -3	46855	4.42	required_block_states.nbt
xz 5.2.5 -4	66490	6.27	required_block_states.nbt
xz 5.2.5 -5	58494	5.52	required_block_states.nbt
xz 5.2.5 -6	47809	4.51	required_block_states.nbt
xz 5.2.5 -7	47809	4.51	required_block_states.nbt
xz 5.2.5 -8	47809	4.51	required_block_states.nbt
xz 5.2.5 -9	47809	4.51	required_block_states.nbt

That's^ the xz results for famous required_block_states.nbt. It experiences drop of performance after switching from fast to slower and in majority of cases tighter, and more optimal, algorithm, just like it happens in libdeflate.

From this and earlier study data it can be concluded that if you use such narrow field application you should check first how compression algorithm behaves with your data, otherwise you may be surprised by seemingly "illogical performance" caused by "nonlinear" approach.

ace-dent commented 1 year ago

@ebiggers - may I confirm: level 12 will always compress a png better than lower levels? I see the method changes with levels: Fastest -> Greedy -> Lazy -> Near_optimal. With zlib based compressors, I have found some images that do better with. a greedy method, for example. Therefore I test all levels 1-9 with zlib. Should I do the same with libdeflate? Or is this brute force testing unnecessary? Many thanks.

ebiggers commented 1 year ago

level 12 will always compress a png better than lower levels?

No, it's not guaranteed.

With zlib based compressors, I have found some images that do better with. a greedy method, for example. Therefore I test all levels 1-9 with zlib. Should I do the same with libdeflate? Or is this brute force testing unnecessary? Many thanks.

If you want to be 100% sure you get the best possible compression ratio for a given software, whether it's zlib, libdeflate, or something else, you have to try all settings. That's just the way it is, unfortunately. Of course, in the vast majority of cases just using the strongest setting gives the best result, as expected. We're just talking about rare edge cases.

ebiggers / libdeflate

Compression levels adjustment #166