facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.38k stars 2.08k forks source link

Size tresholds for changing the compression parameters penalizes 256K inputs #3729

Open lucianpls opened 1 year ago

lucianpls commented 1 year ago

Description

The problem I am having is related to raster compression in GIS and web maps and machine learning, where it is common to keep raster data in individually compressed tiles, usually 256x256 or 512x512 pixels. An RGBA byte tile at 256x256 pixels, or single byte 512x512 tile is exactly 256K. Floating point or integer tiles at 256x256 are also 256K. When using the simple API, ZSTD compression parameters are chosen based on the level parameter and the input size. There are four sets of such parameters, for input size > 256K, > 128K, > 16K and <= 16K. Based on those thresholds, the 256K size falls at the edge of the second tier. Experimentally I found out that if I make the input tiles slightly larger than 256K, thus switching to the first tier, the middle level compression is twice as fast while the result is smaller by a few percent. This is surprising but expected, based on the description of the match finder introduced in version 1.5. The size limits where the parameter set changes seem arbitrary, and in this particular but common case the current values hurts compression performance considerably. Since power of 2 buffer sizes are very common, they should be optimized for in terms of compression and speed, which doesn't seem to be the case here. Possibly it was not as large of a performance difference in version 1.5.0, but it is definitely noticeable now. It likely depends on the input data too, but in all the cases I've tested using the first tier parameters is better than using the second tier.

Proposed solution

The immediate solution is to move the 256K into the first tier. I found that this can be done by changing a single line in lib/compress/zstd_compress.c from
U32 const tableID = (rSize <= 256 KB) + (rSize <= 128 KB) + (rSize <= 16 KB);
to
U32 const tableID = (rSize < 256 KB) + (rSize <= 128 KB) + (rSize <= 16 KB);
As far as I can tell this has the desired effect and doesn't affect decoding using existing binaries. This kind of change might be applicable to the 128K and 16K. It might be useful to lower the 256K threshold to 192K, to include the even more common RGB 256x256 tile.

Alternatives

A slightly more flexible approach would be to allow the tier limits to be chosen at compile time. Since most users will not change the defaults, this should be done in addition to making the 256K input size fall in first tier.
Using the streaming API seems to avoid this problem entirely and using the first tier parameters. I'm sure that using the advanced API can handle this issue. Yet using the simple API should not result in such a large drop in performance.

Cyan4973 commented 1 year ago

It used to be that the settings for the 256KB tier were clearly better than the settings of the first tier for most data <= 256 KB. Indeed, since then, a few internals of the library have changed. So this table tier is probably due a performance settings update.

Another problem with these "tiers" is that they decide settings based on training on some diverse sample sets. But we know that different files react differently to each individual setting. Therefore, the result on this training is some kind of "rough average" that tends to work fine on the diverse sample set, but it would be possible to focus on just one sample and show that the new setting is actually detrimental for this one case.

And that's another issue we may face here : we have no idea what's the data used in your test, and it's difficult to claim generality from a single sample set focused on one category. The results are therefore better for RGBA tiles, but who's to say they are better for 256 KB inputs in general, at all levels, which is what the proposed change would enforce ?

This is probably worth testing as part of the diverse sample set. Therefore. if there are public sample sets representative of your use case that could be used for compression level training, this would be a great contribution.

Finally, the user can always take in charge compression parameters directly. That's why the advanced interface exists (see --zstd=). All parameters from the first tier can be enforced on any data of any size if the user wishes so. Granted, this is more complex than just invoking zstd with default settings, unfortunately complexity is a cost to pay for fuller fine-grained control.

edit : I tested the tier 2 (<= 256KB) parameters vs the tier 1 parameters on 256 KB inputs, and found nothing obvious. In general, the tier 2 settings tend to compress more but also slower than the tier 1. I only found one counter example (level 3), and that was by a tiny margin. On the other hand, tier 2 settings tend to scale better, and as compression level increases, the difference between these two tiers of settings tend to increase too. For example, Tier 2 level 9 compresses a bit faster and bit better than Tier 1 level 12 on this sample set. Another way to state that the comparison is not obvious and requires multiple data points.

lucianpls commented 1 year ago

@Cyan4973, thanks for your quick answer.

Using the RGB 8bit test images from http://imagecompression.info/test_images/

I'm using GDAL built with MRF using ZSTD. GDAL is a very common GIS library for reading and writing data. It has lots of dependencies, the only strictly required ones for this test should be PROJ4 and sqlite3. And the ZSTD library, of course. MRF is a format that I wrote and still maintain, it uses ZSTD as one of the possible tile compressions. The compression code is at https://github.com/OSGeo/gdal/blob/master/frmts/mrf/mrf_band.cpp#L360C21-L403 The filter applied for the single band case ends up just subtracting the previous pixel value, taking advantage of the locality present in images. The default ZSTD level is 9.

In a folder with the unpacked rgb8bit.zip test images. The ppm format is a tiny text header followed by raw data, very little overhead. I timed running this bash script, which converts each ppm in sequence to an MRF using ZSTD compression per tile.

for f in *.ppm
do
  gdal_translate -q -co COMPRESS=ZSTD -co INTERLEAVE=BAND -co BLOCKSIZE=512 $f dest512/${f%.*}.mrf
done

The BLOCKSIZE parameter is the linear tile size. Band interleave means every band is written as a separate 512x512x1 byte, thus 256K exactly. On a laptop, the script ran in 15.828s, on an SSD drive and lots of RAM. Then I changed the script a bit, to make the page slightly larger than 256K. Using 513x513 tiles, the size is about 257K.

for f in *.ppm
do
  gdal_translate -q -co COMPRESS=ZSTD -co INTERLEAVE=BAND -co BLOCKSIZE=513 $f dest513/${f%.*}.mrf
done

This script ran in 10.57s. There is some constant bash/gdal/MRF overhead, the real ZSTD speed difference is larger than these numbers suggest. The size difference, just looking at the output folder sizes, they were empty to start with:

$ du -ks dest*
249880  dest512
244028  dest513

To really compare apples to apples, I swapped a modified zstd library that uses first tier parameters for 256K, so it can generate exactly the same 512x512 tiles. The compression script was the first one above. It took 10.750s and the output folder reported 243960 KB.

Looking at the individual images, only artificial.ppm image is slightly smaller with the tier two, all the other ones are smaller with the tier one parameters. I found this to be the case in all the other images I've tested. And the compression itself is twice as fast, which is very valuable even if the size would be slightly larger. I agree that it might not be the case for all data, but it does look very consistent for natural images. I didn't select these images in any way, it's just the first available image set that I've found on a quick internet search.

There are petabytes of natural imagery being used in GIS and many other fields, this is likely to impact current and future users, and it's hard to notice. I only found out about it by chasing strange performance figures, yet ZSTD has been used in MRF for a couple of years now. I am likely to switch to using the streaming API in MRF, which side-steps this problem, yet I wanted to raise the awareness of this significant issue.

Cyan4973 commented 12 months ago

Some comments on this feedback: I'm not too surprised that the Tier2 settings (for blocks <= 256 KB) is slower than Tier1 settings (for large inputs). What is more surprising though is that Tier2 compression ratio is not better than Tier1 when operating on 256KB blocks.

There are multiple reasons which could explain it, though I would need direct access to data and detailed analysis of generated frame to determine that. For the time being, we can start by comparing parameters : Tier2:

{ 18, 18, 19,  5,  4,  8, ZSTD_lazy2   },  /* level  9 */

Tier1:

{ 22, 20, 21,  4,  5, 16, ZSTD_lazy2   },  /* level  9 */

But effectively, if one uses Tier1 parameters on 256KB data, zstd will auto-adjust parameters, to save resources, and you'll end up with:

{ 18, 18, 19,  4,  5, 16, ZSTD_lazy2   },  /* level  9 */

So that's pretty close. The differences are : 1) Tier1 searches less (Tier1: 2^4==16 probes max) 2) Tier1 focuses on matches of minimum length 5 (instead of 4 :Tier2)

The second difference is likely the more important one. If I understand correctly, the raw input data consists of 32-bit rgba pixels. Since Tier2 is looking for 4-byte matches, it's more likely to find matches than Tier1, because a single identical pixel is good enough for a match. Tier1 is focusing of 5-bytes matches, hence needs a little more than 1 pixel.

So, on first look, it seems that Tier2 is going to find more matches, therefore it should compress more.

But that's only the first effect. What can then happen is that :

And that's where we see that any compression ratio comparison between parameter sets is highly data dependent. I wouldn't draw general conclusions on results drawn from a single uniform data source. It may be that minmatch=5 is a better strategy for this particular data. This is highly unusual though, normally 4 wins most comparisons for "smaller" inputs. But that's certainly possible in specific cases, and that might be one of them.

If that is the case, one could use the advanced API to set the minmatch parameter to 5, and see if it improves compression ratio accordingly. And then speed can be increased by reducing the nb of searches (searchLog) if need be.

That's why there is an advanced libzstd API, which allows precise fine-tuning of parameters for specific scenarios. And it's not a surprise that specific scenarios deserving this level of attention end up benefiting from adjusted custom parameters. Default is merely a "nice starting point", usually fine enough, but playing with parameters may offer better trade offs.

lucianpls commented 12 months ago

The input is reorganized to single band since RGBA data always compresses worse. Matching sequences are not very common in natural images, not even at the 4-5 byte length. I think the effect is mostly due to the byte delta, which produces a byte distribution biased towards low values (positive and negative).