facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
22.86k stars 2.04k forks source link

Significantly worse compression than ZLIB on DWARF debug info #2832

Open resistor opened 2 years ago

resistor commented 2 years ago

Describe the bug ELF executables are able to store DWARF debug info in a compressed format, using ZLIB. I have been experimenting with using ZSTD as a replacement for it, and am generally seeing good results. However, on the ".debug_str_offsets" section, I am seeing significantly worse compression ratios with ZSTD than with ZLIB, even at max compression levels (176KB vs 138KB, from 290KB uncompressed).

To Reproduce I have attached uncompressed.bin.gz, which is an extracted .debug_str_offsets section from a debug build of the LLVM compiler. It has been compressed with gzip -9. You can reproduce my results above by extracting it to obtain uncompressed.bin, and recompressing that with zstd -19.

Expected behavior I would generally expect ZSTD's compression ratio to be no worse than that of ZLIB.

Desktop (please complete the following information): zstd command line interface 64-bits v1.5.0 gzip 1.11

Cyan4973 commented 2 years ago

I'm getting good compression ratio on this file at level 18 specifically (<130KB). Then it gets worse at levels 19 and beyond.

This is uncommon, though it does unfortunately happen from time to time. The explanation is something like a "trapped into local minima" scenario, which prevents the algorithm to "find" the globally shortest path, due to minuscule evaluation differences at the beginning of the file. This is a known problem for all these algorithms, and finding a way to not miss the "better" opportunity is non-trivial, often involving compromises.

Anyway, your file is proving a great training ground to analyze this situation. This could prove useful in finding a solution.

resistor commented 2 years ago

Thanks for the analysis. Unfortunately for this example, it's also the case that ZLIB outperforms ZSTD at the default compression level as well. (138KB vs 234KB)

It also appears that there's something specific about the data patterns in these .debug_str_offset sections that is causing ZSTD pain. I consistently see them compressing significantly better with ZLIB than with ZSTD across a range of object files. I can source some more examples if it would be helpful.

Cyan4973 commented 2 years ago

The files seem to generate a ton of short matches of length 3, following a regular 4-bytes pattern. It's not too surprising, every time we've seen ratio problems at high settings, it has involved such kind of pattern. Length 3 matches are "relatively difficult" for zstd. They can be found, but then, it's more difficult to determine if they are worthwhile or not. Worse, depending on selecting one match or the other, it can influence the selection for the rest of the block, because evaluation is really very close, so it's easy to lean one way or the other, and then continuously reinforce the selected way.

For files > 256 KB, only levels 18+ can hope to handle that properly. Even then, it's difficult, and while generally "okay", there are cases where the algorithm leans on the wrong side.

I presume we have enough data to learn from the provided sample already. Once it gets improved, it will be interesting to see if the tentative update survives more examples.

resistor commented 2 years ago

I would expect most of the data in this file to be LEB128 variable-width integers. I haven't checked, but it's plausible that they're sorted in some way that would cause the 3-byte ones to come first.

dwblaikie commented 2 years ago

ah, no, they aren't LEB128 encoded in .debug_str_offsets - they're regular sized for faster lookup (by index): https://dwarfstd.org/doc/DWARF5.pdf bottom of page 240: "This header is followed by a series of string table offsets that have the same representation as DW_FORM_strp. For the 32-bit DWARF format, each offset is 4 bytes long; for the 64-bit DWARF format, each offset is 8 bytes long."

resistor commented 2 years ago

Thanks for that clarification.

Looking at the specification and the raw bytes, the data consists of a 12-byte header followed by a sequence of 4-byte little-endian unsigned integers, sorted in increasing order. In this particular file, the high byte of the values is always zero. I'm not sure where specifically 3-byte matches would be coming from, though there's plenty of redundancy in general.

resistor commented 2 years ago

I am attaching two more examples of other debug info sections that compress worse with ZSTD than with ZLIB at default compression levels.

debug_loclists.bin.gz - ZSTD 6.4MB vs ZLIB 5.9MB debug_rnglists.bin.gz - ZSTD 6.7MB vs ZLIB 5.2MB

Both of these examples are prefixes of the actual section (due to GitHub attachment limits). One interesting observation: The smaller the prefix I use, the better ZSTD performs relative to ZLIB, i.e. the ratio difference grows super-linearly with input size.

DemiMarie commented 2 years ago

This is uncommon, though it does unfortunately happen from time to time. The explanation is something like a "trapped into local minima" scenario, which prevents the algorithm to "find" the globally shortest path, due to minuscule evaluation differences at the beginning of the file. This is a known problem for all these algorithms, and finding a way to not miss the "better" opportunity is non-trivial, often involving compromises.

For applications in which compression speed is basically irrelevant, would it be possible to try expensive options, such as trying many different paths and picking the best one? I would not at all be surprised if optimum compression is NP-hard or worse, but I imagine there are options if one has compute power to spare.

Cyan4973 commented 2 years ago

For applications in which compression speed is basically irrelevant, would it be possible to try expensive options, such as trying many different paths and picking the best one? I would not at all be surprised if optimum compression is NP-hard or worse, but I imagine there are options if one has compute power to spare.

Yes. But the computing requirement increases very steeply as nb of analyzed options increases too.

DemiMarie commented 2 years ago

Would using a dictionary help? What about preprocessing?

Cyan4973 commented 2 years ago

All these options are possible. They cost a lot of investigation time though.

ishitatsuyuki commented 1 year ago

FYI, I just found some benchmark on debuginfo done by MaskRay: https://maskray.me/blog/2022-09-09-zstd-compressed-debug-sections

Looking at the number the gist seems that using a level around -9 yields speed comparable to gzip while almost always outperforming its ratio.

dwblaikie commented 1 year ago

Except that analysis doesn't seem to have data on .debug_str_offsets in particular - which was an outlier in @resistor's analysis.

So this is still an interesting bug - though, yes, DWARF producers can adopt zstd for a win overall - it's just a bit awkward when one section regresses (we could technically keep compressing that one section with zlib for a win there, which is an unfortunate incentive).

nickaein commented 1 year ago

We're having similar issue when compressing our parquet files. zstd leads to worse compression ratio compared to gzip and only can beat gzip at level 18 or higher which is too slow for our use-case.

Here is a sample of parquet sizes compressed with zstd at level=9:

346.5 M  /corp/test_data/zstd9/part=2022-10-04-00
221.9 M  /corp/test_data/zstd9/part=2022-10-04-01
156.4 M  /corp/test_data/zstd9/part=2022-10-04-02
128.8 M  /corp/test_data/zstd9/part=2022-10-04-03
127.9 M  /corp/test_data/zstd9/part=2022-10-04-04
189.8 M  /corp/test_data/zstd9/part=2022-10-04-05
504.5 M  /corp/test_data/zstd9/part=2022-10-04-06
1.2 G    /corp/test_data/zstd9/part=2022-10-04-07
1.4 G    /corp/test_data/zstd9/part=2022-10-04-08
1.5 G    /corp/test_data/zstd9/part=2022-10-04-09
1.8 G    /corp/test_data/zstd9/part=2022-10-04-10
1.8 G    /corp/test_data/zstd9/part=2022-10-04-11

and here is size of same data compressed with gzip:

308.1 M  /corp/test_data/gzip/part=2022-10-04-00
196.0 M  /corp/test_data/gzip/part=2022-10-04-01
137.5 M  /corp/test_data/gzip/part=2022-10-04-02
115.8 M  /corp/test_data/gzip/part=2022-10-04-03
112.2 M  /corp/test_data/gzip/part=2022-10-04-04
167.4 M  /corp/test_data/gzip/part=2022-10-04-05
451.8 M  /corp/test_data/gzip/part=2022-10-04-06
1.1 G    /corp/test_data/gzip/part=2022-10-04-07
1.2 G    /corp/test_data/gzip/part=2022-10-04-08
1.4 G    /corp/test_data/gzip/part=2022-10-04-09
1.6 G    /corp/test_data/gzip/part=2022-10-04-10
1.7 G    /corp/test_data/gzip/part=2022-10-04-11

The data cannot be posted on public domain, but I can provide a sample parquet file if that helps. Just contact me at nickaein.i@gmail.com

Cyan4973 commented 1 year ago

These files are rather large. I suspect the issue is that the compression level heuristic makes a bad bet regarding the minmatch value, which in this case is too high and results in too many lost opportunities.

Presuming you are using the zstd CLI, could you try something like :

zstd -9 --zstd=mml=4 FILE

and see how compression ratio reacts to this parameter ?

nickaein commented 1 year ago

I should clarify that lists I posted are size of directories and each one usually contains three parquet files. So the size of files can roughly be estimated by dividing those numbers by 3. Sorry for misleading stats.

I've selected one of the parquet files for testing that is 62 MB in uncompressed format and gzip gets it to 33 MB.

With zstd's CLI, here is the output sizes for different configurations:

mml=3 mml=4 mml=5
level=9 35 MB 35 MB 35 MB
level=13 34 MB 34 MB 35 MB
level=16 30 MB 34 MB 35 MB
level=18 30 MB 34 MB 34 MB
level=19 30 MB 34 MB 34 MB

Version info:

zstd v1.4.8 gzip 1.10

Cyan4973 commented 1 year ago

OK, so these parquet files seem very responsive to mml=3. Likely, these files consist of many 32-bit values.

Unfortunately, mml=3 is only properly supported with strategy btopt and higher. Without further customization, and presuming "large" files (> 256KB), it means levels 16+.

As a work around, note that strategy is also a parameter that can be enforced manually, like mml. Try for example:

zstd -9 --zstd=mml=3,strat=7 FILE

Eventually, combining low levels with slow strategies can lead to weird combinations that doesn't always work well. Exact outcome varies, depending on data.

Within tests, there is a barely documented program called paramgrill, that could help find an optimal combination of parameters given a sample set and a set of constraints, such as compression speed. The interface is clunky though, and the implementation is a bit buggy. It "tends to work" though, but it's not 100% guaranteed.

Supporting mml=3 for faster strategies is currently not implemented. While it's not impossible to do, problem is, in the general case, it's a rather bad idea. I get it that some data seem to be especially sensitive to this parameter, and would benefit from such an on-demand capability. Unfortunately, the current code base is not ready for that, and it would take some effort to make it happen.

gcflymoto commented 1 year ago

Hi @Cyan4973, we're also seeing that zstd with some datasets requires exploration of fine-tuning the parameters which for the casual user can be very daunting. Has the zstd team explored the possibility of integrating the concept of paramgrill into zstd itself? An "auto-paramater-tuning" capability for zstd would really set it apart from other compression tools. It could alleviate the roadblocks when attempting to integrate zstd but running into glassjaws as compared to gzip.

It sounds like the heuristic search that paramgrill performs is more than adequate for reaching optimal combinations? or is an ML based re-enforcement learning approach worth exploring? Or something like OpenTuner?

Documenting example usages for how paramgrill can be used would be greatly appreciated

I guess in general what should our approach be to parameter tuning for different usage models?

Thank you for the outstanding technology that is zstandard

Cyan4973 commented 1 year ago

paramgrill unfortunately is not meant to be run "inline". It's essentially brute-force, so that's not energy efficient, and wouldn't fit within zstd itself.

paramgrill is more meant to be associated with an offline "training stage". This is similar to dictionary generation.

Finding better heuristics to determine parameters is a great topic. One difficulty though is to achieve that goal in a variety of speed / efficiency trade off, in order to remain compatible with zstd levels' speed targets. ML could be used, assuming it's a lightweight variant (small NN), though probably not for the fastest levels.

Cyan4973 commented 4 weeks ago

A few words on this topic: we've kept this issue opened, because it's a use case which is interesting and for which we want zstd to do better.

Starting v1.5.6, zstd at high compression levels features a better compression ratio on 32-bit structures. The .debug_str_offsets section is a pretty good example of this.

Specifically, on the sample provided in this issue, compressed size at level 19 is improved, from 179,814 bytes to 131,410 bytes (a +36.8% improvement). Critically, it is now more compact than reported gzip/zlib result at maximum level.

This outcome seems to resolve this issue.