facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.38k stars 2.08k forks source link

Surprising behavior on some data set #1979

Open arl opened 4 years ago

arl commented 4 years ago

I ran some zstd benchmarks on various datasets. At my company we're using zstd for many applications which don't have the same requirements in terms of cpu, memory, speed and ratio.

Each plot shows the results of benchmarks run with zstd -q -b3 -e14, zstd -q -b3 -e14 --long=27 and zstd -q -b3 -e14 --long=29.

Warning: (in these pictures, ratio is original/compressed, so bigger numbers means more compression, as expressed in zstd benchmark mode)

16 over 18 dataset show expected plots, while 2 of them show, at least for me, a surprising behavior:

This is how zstd behaves for most of the datasets: image

But here are the 2 data sets with unexpected results: image

image

NOTE: all benchmarks were run on zstd v1.4.4 on linux/x86-64 architecture

Since then, I re-ran them multiple times in the original conditions and obtained similar results. I also reproduced that on another machine (same architecture) with version v1.4.3.

Here it is: image

Is this a bug? is it expected somewhat?


`api` raw data ``` title=api lines plot=--long=no bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -3 26960545 (46.895) 1166.39 MB/s 3773.7 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -4 27029704 (46.775) 1181.65 MB/s 3779.3 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -5 25575654 (49.434) 266.62 MB/s 3863.2 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -6 24745208 (51.094) 257.14 MB/s 4023.8 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -7 23235681 (54.413) 207.01 MB/s 4490.1 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -8 22522655 (56.135) 183.19 MB/s 4691.2 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -9 22244120 (56.838) 144.06 MB/s 4572.9 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -10 22394956 (56.456) 120.05 MB/s 4506.5 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -11 22326886 (56.628) 106.40 MB/s 4421.1 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -12 22102242 (57.203) 87.16 MB/s 4421.3 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -13 21859954 (57.837) 63.23 MB/s 4389.3 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -14 21721543 (58.206) 52.89 MB/s 4411.4 MB/s api.log.multi plot=--long=27 bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -3 26120036 (48.404) 116.30 MB/s 3520.0 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -4 27565578 (45.866) 113.13 MB/s 3095.5 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -5 27691468 (45.657) 102.20 MB/s 2928.2 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -6 27258039 (46.383) 101.79 MB/s 3005.4 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -7 26411661 (47.870) 98.29 MB/s 3142.9 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -8 26115750 (48.412) 95.04 MB/s 3223.2 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -9 26134271 (48.378) 87.40 MB/s 3071.9 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -10 26186735 (48.281) 80.54 MB/s 3076.4 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -11 26195783 (48.264) 73.10 MB/s 2987.6 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -12 26036124 (48.560) 70.31 MB/s 3050.8 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -13 26150741 (48.347) 55.77 MB/s 2977.8 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -14 26123748 (48.397) 50.61 MB/s 3023.1 MB/s api.log.multi plot=--long=29 bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -3 26690138 (47.370) 118.34 MB/s 3489.1 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -4 28143404 (44.924) 114.54 MB/s 3060.1 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -5 28127385 (44.950) 105.30 MB/s 2859.2 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -6 27667942 (45.696) 104.77 MB/s 2927.1 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -7 26864003 (47.064) 101.09 MB/s 3097.9 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -8 26603721 (47.524) 98.19 MB/s 3115.4 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -9 26609787 (47.513) 89.50 MB/s 2997.4 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -10 26701938 (47.349) 82.92 MB/s 2941.6 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -11 26708182 (47.338) 76.06 MB/s 2835.2 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -12 26520902 (47.673) 72.57 MB/s 2913.3 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -13 26614552 (47.505) 55.92 MB/s 2851.1 MB/s api.log.multi bench 1.4.4 : input 1264319464 bytes, 3 seconds, 0 KB blocks -14 26621317 (47.493) 50.50 MB/s 2889.3 MB/s api.log.multi ```
`dat` raw data ``` title=dat lines plot=--long=no Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -3 56281414 (50.477) 1198.28 MB/s 3818.1 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -4 56612769 (50.182) 1195.63 MB/s 3771.4 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -5 54181245 (52.434) 262.53 MB/s 4150.7 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -6 52702676 (53.905) 248.03 MB/s 4222.0 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -7 50863166 (55.855) 213.50 MB/s 4476.6 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -8 49629523 (57.243) 196.81 MB/s 4601.0 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -9 48835902 (58.173) 155.05 MB/s 4599.8 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -10 48248656 (58.881) 129.35 MB/s 4417.5 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -11 48080924 (59.087) 111.17 MB/s 4377.7 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -12 47315330 (60.043) 94.42 MB/s 4405.7 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -13 46865871 (60.619) 61.70 MB/s 4434.8 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -14 46555855 (61.022) 53.15 MB/s 4416.4 MB/s dat.log.multi plot=--long=27 Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -3 58345691 (48.692) 162.99 MB/s 3602.6 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -4 62274245 (45.620) 156.69 MB/s 3193.0 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -5 62374461 (45.547) 139.49 MB/s 3106.9 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -6 61264457 (46.372) 135.01 MB/s 3135.7 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -7 60095917 (47.273) 129.01 MB/s 3268.2 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -8 59649707 (47.627) 122.90 MB/s 3262.3 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -9 59863260 (47.457) 111.73 MB/s 3199.6 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -10 59838214 (47.477) 101.03 MB/s 3153.4 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -11 59826938 (47.486) 92.46 MB/s 3113.7 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -12 59530570 (47.722) 88.07 MB/s 3099.6 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -13 59664469 (47.615) 73.21 MB/s 3102.1 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -14 59547275 (47.709) 66.71 MB/s 3092.8 MB/s dat.log.multi plot=--long=29 Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -3 58398102 (48.648) 155.60 MB/s 3608.6 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -4 62527809 (45.435) 149.06 MB/s 3116.7 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -5 62853791 (45.199) 133.58 MB/s 2992.4 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -6 61719684 (46.030) 131.33 MB/s 3018.1 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -7 60507410 (46.952) 124.13 MB/s 3159.5 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -8 60036133 (47.321) 117.89 MB/s 3182.8 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -9 60395106 (47.039) 106.37 MB/s 3050.7 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -10 60452469 (46.995) 97.63 MB/s 2967.8 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -11 60433215 (47.010) 89.39 MB/s 2925.0 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -12 60143641 (47.236) 83.47 MB/s 2925.1 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -13 60209403 (47.184) 70.54 MB/s 2938.4 MB/s dat.log.multi Not enough memory; testing 2709 MB only... bench 1.4.4 : input 2840941909 bytes, 3 seconds, 0 KB blocks -14 60045117 (47.313) 63.81 MB/s 2920.5 MB/s dat.log.multi ```
Cyan4973 commented 4 years ago

Thanks for this great feedback @arl ,

if I do understand properly the graphs for the "surprising" use cases, I note that the achieved ratios are extremely high, in the x50 range. This means source data is extremely redundant. This is an unusual territory, and indeed, some of the heuristics selected may not work well under such scenario.

Regarding speed : I believe this part is correct and expected. The --long mode introduces an additional scan, which is neither slow nor extremely fast. On your system, it translates into a capped speed of ~150 MB/s. On extremely redundant data, the "regular" match finder might actually process data very quickly, because it has such an easy time finding long matches and skipping over entire sections of source input. So the graph may look like the --long mode is progressing to a crawl, but actually, it's the one which preserves a constant speed, of ~150 MB/s. In contrast, the regular match finder enjoys an impressive speed boost, at all levels, and more especially at lower ones.

So speed looks fine. What about compression ratio ?

Yes, this part is less good. --long should, ideally, not negatively impact compression ratio. Ideally though is the term, because we use simple heuristics to make decisions, they usually work, but indeed, not always.

A simple starting point is that the --long mode is only interested in long matches (like >= 64 bytes), and when such a match is found, it achieves a great compression ratio. So a simple design is to just trust that this is a good choice, and apply it verbatim, on the ground that trying to optimize that even more is not terribly beneficial.

Well, that's generally true, because most matches are short. But in these cases, since the compression ratio is so good, it means that finding long matches is not difficult. So the regular match finder has access to matches which are as good as what the --long mode is also finding, but by doing a bit more optimization regarding exact match selection, it pays off, and offers a bit more compression.

Note that, the graph may look like that one is compressing greatly and the other poorly, but it's not the right image. --long stretches from x46 to x48, while "regular" stretches from x48 to x58. It's a 20% difference at most. Still, it's sizable.

So, maybe we could try to optimize that angle, finding a way for match selection to be more optimized and also more coordinated between the "regular" and the --long match finders. This will likely increase code complexity, but maybe there is a tractable territory where the benefits are worth this complexity.

We will likely need a sample where this impact is visible, so that we have a clear target to optimize and measure against.

edit : quick verification question : I've assumed that all data sets are rather "large", meaning > 1 GB, justifying the usage of --long mode. Is that also the case for the 2 data sets showing this "strange" behavior ?

arl commented 4 years ago

if I do understand properly the graphs for the "surprising" use cases, I note that the achieved ratios are extremely high, in the x50 range. This means source data is extremely redundant. This is an unusual territory, and indeed, some of the heuristics selected may not work well under such scenario.

Indeed the data is very redundant. To quickly describe the files: they contain CSV log lines, using 0x1e as field delimiters and \n as line delimiter. There's no header at the top of the CSV to know what a column represents, instead, the column indices are fixed. It happens that, in those files with the surprising behaviour, many fields are empty, leading to big successions of 0x1e. Also the empty fields are the same for all lines, so there's the same number of consecutive 0x1e in every line. To add to the repetition, there is also a non-negligible number of duplication in fields with actual data.

Regarding speed : I believe this part is correct and expected. The --long mode introduces an additional scan, which is neither slow nor extremely fast. On your system, it translates into a capped speed of ~150 MB/s. On extremely redundant data, the "regular" match finder might actually process data very quickly, because it has such an easy time finding long matches and skipping over entire sections of source input. So the graph may look like the --long mode is progressing to a crawl, but actually, it's the one which preserves a constant speed, of ~150 MB/s. In contrast, the regular match finder enjoys an impressive speed boost, at all levels, and more especially at lower ones.

Makes total sense. Thanks for that explaination.

So speed looks fine. What about compression ratio ?

Yes, this part is less good. --long should, ideally, not negatively impact compression ratio.

Indeed that is the part I found suprising.

Ideally though is the term, because we use simple heuristics to make decisions, they usually work, but indeed, not always.

A simple starting point is that the --long mode is only interested in long matches (like >= 64 bytes), and when such a match is found, it achieves a great compression ratio. So a simple design is to just trust that this is a good choice, and apply it verbatim, on the ground that trying to optimize that even more is not terribly beneficial.

Well, that's generally true, because most matches are short. But in these cases, since the compression ratio is so good, it means that finding long matches is not difficult. So the regular match finder has access to matches which are as good as what the --long mode is also finding, but by doing a bit more optimization regarding exact match selection, it pays off, and offers a bit more compression.

Note that, the graph may look like that one is compressing greatly and the other poorly, but it's not the right image. --long stretches from x46 to x48, while "regular" stretches from x48 to x58. It's a 20% difference at most. Still, it's sizable.

So, maybe we could try to optimize that angle, finding a way for match selection to be more optimized and also more coordinated between the "regular" and the --long match finders. This will likely increase code complexity, but maybe there is a tractable territory where the benefits are worth this complexity.

It's true that those ratio are already so high with the regular match finder that we could consider not using --long for those cases. The object of my research though, was trying to find a good set of default options for all data sets, in case such default exists.

We will likely need a sample where this impact is visible, so that we have a clear target to optimize and measure against.

Great. Is the description of the data set I gave you enough? If not, I'd be happy to send you an extract, privately, since it contain sensitive data. I'd also have to redact the sensitive fields.

edit : quick verification question : I've assumed that all data sets are rather "large", meaning > 1 GB, justifying the usage of --long mode. Is that also the case for the 2 data sets showing this "strange" behavior ?

Your assumption is correct: all data sets are > 1GB, in fact 1.2GB for api and 7.8GB for dat.

bimbashrestha commented 4 years ago

The object of my research though, was trying to find a good set of default options for all data sets, in case such default exists

Indirectly related: we now have a new cli endpoint that shows you what compression parameters will be used for compression. You can then adjust the parameters using -zstd=<parameterName>=<value>. You might consider playing around with these options for your research @ari

eg:

# initial defaults
./zstd --show-default-cparams -f -15 CHANGELOG
CHANGELOG (28131 bytes)
 - windowLog    : 15
 - chainLog     : 16
 - searchLog    : 6
 - minMatch     : 3
 - targetLength : 256
 - strategy     : ZSTD_btopt
CHANGELOG            : 35.91%   ( 28131 =>  10101 bytes, CHANGELOG.zst)

# override minMatch param 
./zstd --show-default-cparams -f -15 --zstd=minMatch=4 CHANGELOG
CHANGELOG (28131 bytes)
 - windowLog    : 15
 - chainLog     : 16
 - searchLog    : 6
 - minMatch     : 3
 - targetLength : 256
 - strategy     : ZSTD_btopt
CHANGELOG            : 35.83%   ( 28131 =>  10078 bytes, CHANGELOG.zst)

Obviously the benefit from the manual override in this quick example is negligible but you get the idea.

rasky commented 3 years ago

This might be fixed in zstd 1.4.7 @arl

arl commented 3 years ago

Nice! Thank you all. I'll give it a go when time permits and updates this issue with the results.

Cyan4973 commented 3 years ago

v1.4.9 has doubled the performance of the --long mode, so I presume it would help for above scenario.

Still, the --long mode isn't multi-threaded, and remains the bottleneck (when compared to multi-threaded compression, or even fast modes).

arl commented 3 years ago

Sorry it took me so long (no pun intended) :)

I've ran the benchmarks again with 2 of the datasets used in the original post. prd is the one with expected behaviour and api is the one that was surprising. I'm not able to use the same exact set of files but the data is very similar.

There are 3 plots per dataset, summarizing benchmark results for v1.4.4, v1.4.8 and v1.5.0. The 3 plots for the same dataset have the same axis. All 3 versions were built with the same compiler and options.

prd dataset

Here I can see the an exceptional speed improvement of the long distance matching, nearly twice as fast in v1.5.0!

prd results 1 4 4 prd results 1 4 8 prd results 1 5 0

api dataset

This is the one with highly repetitive data, hence the extreme compression ratio. Here in v1.5.0 the --long option also provides incredible speed improvement compared to previous versions (up to x2 indeed), however the ratio remains higher without it, which is still surprising.

api results 1 4 4 api results 1 4 8 api results 1 5 0

But don't get me wrong I know we're talking about a rare case. I'm only providing this information with the hope it might be useful. zstd is already doing an excellent job, and keep getting better!

Cyan4973 commented 3 years ago

Thanks for the insightful feedback @arl .

Not sure if it's practical for you to test, but I believe we fixed the compression ratio issue for levels 16+.

It would be great to know if it works, as it could give a clue to improve faster levels.

arl commented 3 years ago

There's indeed been a massive improvement in the long distance matching compression ratio for high levels, on my highly repetitive dataset, more visible in v1.4.8. Note that Y axis has a log scale, it helps with the number of points.

For the api dataset, going without long distance matching at levels 14/15 is still a better choice when optimizing for compression ratio.

prd dataset

prd results 1 4 4 prd results 1 4 8 prd results 1 5 0

api dataset

api results 1 4 4 api results 1 4 8 api results 1 5 0

I went up to level 19. I saw afterwards about the existence of the ultra levels. I can update the plots with these if you think it can be interesting @Cyan4973

Cyan4973 commented 3 years ago

Thanks for the detailed feedback @arl . Level 19 is good enough to give a complete picture.

The compression ratio improvement for --long + Level >16 seems to work as intended.

Now, it's disappointing to see that compression efficiency doesn't follow a smooth progression beyond 15, especially on the api dataset. And even on ptrd, the reduction in ratio of level 18 is unexpected.

Now, this is likely a consequence of dealing with files with extremely high compression efficiency. > x50 is uncommon. Even x30 for ptrd is already in "very high compressibility" territory. There are likely some heuristics in the code that do not work well in these cases, since we are more used to test samples featuring from x3 to x8 compression efficiency.

It's difficult to go farther in the analysis without access to samples for deeper analysis, but even without samples, it's good food for thoughts !

kevLmurphy commented 10 months ago

I stumbled across this issue in my work, and it's the file type that caused it rather than the data, so I can share an example.

Printer Command Language files are used by professional-level HP printers. I'll be referring to PCL5 (version 5) as PCL throughout.

They are extremely repetitive and perfect candidates for zstd compression. They're a mixture of about 35 commands/symbols and unicode text only. It's not at all uncommon for a 500 MB PCL file to compress down to 2.5 MB.

I started with this file. It contains every PCL5 command. https://github.com/michaelknigge/pcldumper/blob/master/pcldumper/src/test/resources/generic_test.pcl

I then copied it into itself about 150000x, as it is only about 18 lines long and only one of those is PCL text format. This is still on the smaller side for a PCL file, these files are not usually used unless you are printing tens of thousands of pages per day. Due to their size, dictionaries are also not really useful for compression.

Here's the sample compressed with zstd ultra -t22 and then zipped.

out22.zip

I have observed similar results with pretty much any PCL file that is large enough (around 100 MB). They generally compress to below 1% of their original size, even with valid data.

Tested on windows 11, ryzen 5600x. 1.5.5 zstd. I've observed similar results on RHEL 7.9 under intel XEON gold 6140 as well.

Testing command = zstd -T0 --ultra --long -$level $pclfile zstd -T0 --ultra -$level $pclfile

Results Input is 144MB All other numbers will be in bytes. Key- Compression Level- (nonLong vs Long)

22- (13,259 vs 13,259 bytes) Same result, regardless of --long. 18,19,20,21- (13,340 vs 13,329 bytes). all have exactly the same output sizes --long=27 is slightly smaller than non-long, as expected. 17- (13,333 vs 13,320) Better compression than levels 18-21. --long=27 is slightly smaller than non-long, as expected. 16- (13,424 vs 13,365) --long=27 is slightly smaller than non-long, as expected. 15- (14,450 vs 16505) --long=27 is unexpectedly worse. However, this is very weird - when run without long, it takes 1168.0968 ms to compress. When run with long, it takes 185.563 ms. It's consistently much slower than other levels. (On zstd --long ultra -22, it takes 304.5 ms. So it's taking close to 4x as long to run). 14- (14,450 vs 16505) --long=27 is unexpectedly worse. Much faster than 15, but the same result. 9- (14,450 vs 14,450) --long=27 does not matter. Much faster than levels 12-15 but equivalent result. 3- (14,487 vs 16,510) --long=27 is unexpectedly larger.

Fast-1 also has this issue Fast-5 does not.

To be clear, this isn't exactly a large problem for PCL files as they already compress so well. They're just an easy way to see the effect referenced in this issue, though some higher levels (ie 17-19) don't have the issue from the original post. (Could be due to a newer zstd version? data being more compressible? Unclear.)