Open tansy opened 2 years ago
First, note that the silesia
corpus is not great, especially when it is compressed all as one file, as that gives very different weights to the files inside it. It's also a bit text-heavy. Out of habit, I do still compress the whole silesia
corpus as one of my tests. But I tested libdeflate v1.9 on many other files too, many more than previous versions of libdeflate.
I think the result you're seeing is largely the natural consequence of the algorithms currently available. Fundamentally, it is hard to fill the gap between lazy and near-optimal parsing. Previous versions of libdeflate tried to fill this gap at levels 8-9 by running the near-optimal parser with weak parameters. Your suggestion does basically the same, but does it for levels 10-11 instead. However, this does very badly on some data, worse than lazy parsing. E.g., see #85. Here are results from the file from #85 with and without your proposed patch:
Level | libdeflate v1.9-patched | libdeflate v1.9 | zlib |
---|---|---|---|
1 | 111669 / 4 | 111669 / 4 | 118729 / 6 |
2 | 106769 / 6 | 106769 / 5 | 110383 / 7 |
3 | 99898 / 3 | 99898 / 3 | 102669 / 7 |
4 | 92318 / 6 | 92318 / 6 | 97863 / 12 |
5 | 91267 / 4 | 91267 / 4 | 90502 / 13 |
6 | 81270 / 8 | 81270 / 8 | 74639 / 20 |
7 | 74637 / 13 | 74067 / 14 | 73230 / 24 |
8 | 72515 / 21 | 71515 / 30 | 71121 / 44 |
9 | 71003 / 26 | 71003 / 44 | 71129 / 48 |
10 | 88324 / 67 | 72572 / 106 | N/A |
11 | 73515 / 112 | 69899 / 218 | N/A |
12 | 68515 / 441 | 68515 / 413 | N/A |
(Cells show compressed size / time in milliseconds
.)
So, with your proposed patch, level 10 would compress worse than level 6 on this file, despite being 8x as slow. libdeflate v1.8 and earlier had essentially the same issue for the same reason, but with the drop-off occuring at level 8.
Here are the results from a file containing DNA sequencing reads:
Level | libdeflate v1.9-patched | libdeflate v1.9 | zlib |
---|---|---|---|
1 | 8627220 / 383 | 8627220 / 386 | 11185034 / 477 |
2 | 7389117 / 482 | 7389117 / 481 | 10172931 / 520 |
3 | 6517790 / 530 | 6517790 / 506 | 9300280 / 815 |
4 | 6106557 / 657 | 6106557 / 681 | 7606177 / 919 |
5 | 6043494 / 611 | 6043494 / 572 | 6904720 / 1452 |
6 | 5918475 / 995 | 5918475 / 963 | 6658313 / 3130 |
7 | 5885276 / 2278 | 5870664 / 2496 | 6592924 / 5315 |
8 | 5834253 / 3626 | 5771690 / 4851 | 6448238 / 11288 |
9 | 5767417 / 5555 | 5767417 / 5682 | 6427283 / 12526 |
10 | 5925121 / 4648 | 5698886 / 7324 | N/A |
11 | 5709114 / 6983 | 5642186 / 11360 | N/A |
12 | 5575386 / 21173 | 5575386 / 21234 | N/A |
Similarly, with your proposed patch, level 10 would compress worse than level 6 on this file. In contrast, unpatched v1.9 scales pretty nicely. Note that currently, DNA sequencing data is one of the most common data-sets people are using with libdeflate (AFAIK).
It may be instructive to include zlib in the comparison with silesia
as well:
Level | libdeflate v1.9 | zlib |
---|---|---|
1 | 73505695 / 1150 | 77261079 / 2062 |
2 | 71070866 / 1381 | 75002464 / 2259 |
3 | 70170808 / 1485 | 72969695 / 2934 |
4 | 69472978 / 1577 | 71003558 / 3316 |
5 | 68170275 / 1859 | 69163650 / 4663 |
6 | 67511903 / 2257 | 68229313 / 6592 |
7 | 67141845 / 2951 | 67940580 / 8036 |
8 | 66766245 / 4722 | 67695799 / 12096 |
9 | 66715225 / 5974 | 67644637 / 15801 |
10 | 64790395 / 20745 | N/A |
11 | 64705562 / 26929 | N/A |
12 | 64689492 / 33595 | N/A |
From what I've seen, the most common use case of libdeflate is that someone is using zlib at a particular level and they want to switch to libdeflate at the same level and get a big speed improvement, while keeping the same or slightly better compression ratio. The second most common use case is that people want to use level 12.
The latest libdeflate serves both of those use cases pretty nicely, better than previous versions IMO. Note that the gap between levels 9-10 doesn't really matter for these use cases. So while not ideal, I think it is better than having the previous issues.
I think that filling this gap would require a substantially new algorithm, not just tuned parameters. "lazy2" (the algorithm now used by levels 8-9) was sort of intended to be that, but it turned out to be more of an incremental improvement over the regular "lazy".
I get the point about (current) levels 10 - 12 (near_optimal). I will (still) investigate it further to check more details. I still stand by my opinion regarding levels 6 - 9. Even with your odd data sample it does a good job, better than current parameters; only level/s 10+ are anomaly.
You still cannot adjust whole program to some exceptional cases. If these minecraft data are "dynamic", to be compressed as packets, noone will use levels 10+. Not even 7+. Packet compression is meant to be fast rather that tight. And, again, if these are "static" data to be compressed one and then just decompressed they will take a closer look into the compression levels and carefully choose one they like the most and will not care if the are some local anomalies - they will take a time to study it closely.
Speaking of anomalies, you can check those medical images and you will see funny things add well. Still none tells you to adapt to this data set.
I still stand by my opinion regarding levels 6 - 9. Even with your odd data sample it does a good job, better than current parameters; only level/s 10+ are anomaly.
I'm not sure what you're referring to here, give that your patch would weaken levels 7-8, and not change levels 6 and 9. Did you maybe mix up the columns? libdeflate v1.9-patched is with your patch applied, while libdeflate v1.9 is without it. Are you talking about performance only?
You still cannot adjust whole program to some exceptional cases.
One person's exceptional case is another person's normal case :-)
Anyway, please also look at the absolute numbers and not just how the compression levels compare to each other. There were a lot of improvements in v1.9, and libdeflate does even better compared to zlib now. On silesia
you can see that libdeflate level 6 now compresses more than zlib level 9.
I'm not sure what you're referring to here, give that your patch would weaken levels 7-8, and not change levels 6 and 9
In simple words, yes. Like I said levels 6 and 9 (and 12 for that matter) stay the same. Level 6 because it is tweeked as default; 9 because it's maximum compression for lazy/2 algorithms, so there is nothing to improve; 12 essentially the same as 9.
Only "issue" are levels 7 and 8 that are "clumped" and is most of the time level 8 virtually indistinguishable from 9. That's what I meant. Situation analogical to levels 9-12, with the "gap" between 9 and 10, here between 6 and 7, and 11-12 close to each other, here 8-9.
Levels 7-8 have adjusted parameters, mainly nice_match_length
that grows exponentially - multiplied by 1.6
so it gives:
lv | 6 | 7 | 8 | 9 |
---|---|---|---|---|
1.10-1 nice | 65 | 104 | 160 | 266 => 258 |
I tested it with DNA samples various sources (specified in paste) and for apps2 (not so text heavy) D(8-9)=18k/170M (0.01%), after change 56k (0.03%). It can be seen in apps2 corpus where differences in ratio in between the levels go:
lv | 6 | 7 | 8 | 9 |
---|---|---|---|---|
1.09 | 0.11 | 0.07 | 0.01 | |
1.10-1 | 0.10 | 0.05 | 0.03 |
(1.09 denotes original 1.09/1.10 version, 1.10-1 stands for adjusted)
Similarly with dna-fasta format (formatted ACGT); even though compression ratio changes significantly between levels.
lv | 6 | 7 | 8 | 9 |
---|---|---|---|---|
1.09 | 4.28 | 3.50 | 0.26 | |
1.10-1 | 4.09 | 2.37 | 1.58 |
With dna-1245 corpus it's really hard to see as it's another anomalous set where level 6 is the best (for lv≤9). Same is true with medical images (lukas_2d...) where level 4 is the best (for lv≤9), then 7, then 9.
Again, it can be only seen in more general data sets rather than specialized, specific examples that are "misbahaving" due to nature of deflate compressor (even zlib behaves )
You won't be changing parameters to adapt to dicom images just because they don't perform in linear fashion, will you?
PS. Last part is compressing png images. Bunch of decompressed images, which turns out not to be much telling add it is, more less, general set, one image, and animation compressed every file separately, as it appears in apng.
PPS. Because you said libdeflate is used to compress dna sequences I tried it and it seems that it really depends on format (see dna-virus/cvirus.fasta) used and probably and to lesser extent on actual data, how they act during compression. From this test it can be concluded that: It behaves like more general data for pizzachilli/dna.50MB corpus, with little difference between levels 8 and 9 @v1.9 (and and even less different between 10-12 @v1.9). With local minimum at lv6 for dnacorpus/dnacorpus-60M.tar @v1.9/v1.10-1 (and global minimum at lv10 @v1.9, at lv 11 @v1.10-1; which are more less the same) For dna-virus/cvirus.2bit-50Mi there is virtually no difference between levels 8 and 9 @v1.9. For dna-virus/cvirus.fasta-50Mi there is very little difference between levels 8 and 9 @v1.9 (and practically no difference between lv 10-11-12 @v1.9). For proteins and cvirus-2bit levels 7-9 are faster than level 6.
[corpora]:
dnacorpus pizzachilli/dna virus dna pizzachilli/protein protein
[ed]. what I also noticed is that max_search_depth = 136 for level 12 often works better than 150. This one I don't get really.
The last conclusion came to me after seeing this:
Compressor name | Compr. size | Ratio | Filename |
---|---|---|---|
memcpy | 1060424 | 100.00 | required_block_states.nbt |
xz 5.2.5 -0 | 68443 | 6.45 | required_block_states.nbt |
xz 5.2.5 -1 | 59085 | 5.57 | required_block_states.nbt |
xz 5.2.5 -2 | 50506 | 4.76 | required_block_states.nbt |
xz 5.2.5 -3 | 46855 | 4.42 | required_block_states.nbt |
xz 5.2.5 -4 | 66490 | 6.27 | required_block_states.nbt |
xz 5.2.5 -5 | 58494 | 5.52 | required_block_states.nbt |
xz 5.2.5 -6 | 47809 | 4.51 | required_block_states.nbt |
xz 5.2.5 -7 | 47809 | 4.51 | required_block_states.nbt |
xz 5.2.5 -8 | 47809 | 4.51 | required_block_states.nbt |
xz 5.2.5 -9 | 47809 | 4.51 | required_block_states.nbt |
That's^ the xz results for famous required_block_states.nbt
. It experiences drop of performance after switching from fast to slower and in majority of cases tighter, and more optimal, algorithm, just like it happens in libdeflate.
From this and earlier study data it can be concluded that if you use such narrow field application you should check first how compression algorithm behaves with your data, otherwise you may be surprised by seemingly "illogical performance" caused by "nonlinear" approach.
@ebiggers - may I confirm: level 12 will always compress a png better than lower levels? I see the method changes with levels: Fastest -> Greedy -> Lazy -> Near_optimal. With zlib based compressors, I have found some images that do better with. a greedy method, for example. Therefore I test all levels 1-9 with zlib. Should I do the same with libdeflate? Or is this brute force testing unnecessary? Many thanks.
level 12 will always compress a png better than lower levels?
No, it's not guaranteed.
With zlib based compressors, I have found some images that do better with. a greedy method, for example. Therefore I test all levels 1-9 with zlib. Should I do the same with libdeflate? Or is this brute force testing unnecessary? Many thanks.
If you want to be 100% sure you get the best possible compression ratio for a given software, whether it's zlib, libdeflate, or something else, you have to try all settings. That's just the way it is, unfortunately. Of course, in the vast majority of cases just using the strongest setting gives the best result, as expected. We're just talking about rare edge cases.
I noticed that in 1.9 version compression levels "overlap", I mean some of them are basically the same. I took the silesia corpus* and that's he result:
Compression in levels 8 and 9, 11 and 12 are almost the same - difference in ratio of 0.02% and 0.01% is hardly noticable. Level 10 is not much different than 11 either. Difference between 10-12 is ~0.05% and between 9 and 10 is almost 1%.
First I decided to leave levels 6, 9, and 12 as they are and spread those in between by ratio. Also because level 9 is now the line between
lazy/2
andnear_optimal
algorithms. First I thought even spread would be good but then I realised that "logarithmic", or something like that, would be better as it would resemble existing ones. So I calculated new ratios to be 1/2, 3/4 (and 1) of the gap between 6 and 9 (31.85 - 31.48) as well as 9 and 12 (31.48 - 30.52).That would be "ideal" to look for.
First I took the 9-12 levels range and checked what were the results for v1.8. After some tweek I got to the point where they were almost perfectly matched. Then with levels 6-9 it wasn't that easy but I brought it to acceptable point. Now the results are like this:
Deltas calculated for it are as follows:
The results are spread bit more "evenly" and the gap between 9 and 10 is halved.
Similar results are for other data sets.
I think you should consider to adjust these compression levels to that or something similar.
Here are the changes to
deflate_compress.c
in diff. I will do pull request if that is what you are interested in.* I chose it as it is diverse, non homogeneous, relatively big corpus that resembles real life date, imo best for general purpose compressor. I tested other corpora available on net and the results were very similar, almost the same. They include enwik, lukas medical images and my "own", namely app (mozilla - 64bit executables from silesia corpus, google earth 32-bit for windows and firefox for linux), png-dec (bunch of decompressed png images) and html/css/js (bunch of sites styles and scripts; something that imitates html pages).
** To produce results I used lzbench with libdeflate-1.9 support.