facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.45k stars 2.08k forks source link

Poor compression of binary numeric data with surprising results #3014

Open RedBeard0531 opened 2 years ago

RedBeard0531 commented 2 years ago

Describe the bug

I wrote a simple test to produce a file with 1 million 4-byte incrementing ints, and tried to compress with various settings. I tried big and little endian, and I tried starting at both 0 and 0x1234567 to fill all bytes. I'm sure there are a lot more interesting cases, but this seemed like a good starting point and revealed a lot already. Zstd generally did fairly poorly, but there were some interesting results that give me a glimmer of hope that there is room for significant improvement.

Interesting findings:

  1. Compression levels 1 through 17 behave roughly the same regardless of input, reducing to roughly 67% of initial size.
  2. Something changes at level 18 and for most inputs it gives significantly better results
  3. --format=xz -1 still compresses this data much better and faster than --format=zstd -18
  4. Zstd is very sensitive to endianness and base value, while xz barely changes (I wrote this before noticing that I wasn't passing -1 when testing xz as I intended to. When I added it and reran the testing, only one realy changed size, and it got smaller...)

Results:

endian base zstd -18 xz -1 notes
Big 0 10.46% in 0.791s 6.47% in 0.272s best result for zstd
Big 0x1234567 26.84% in 0.578s 6.38% in 0.263s
Little 0 67.21% in 0.532s 2.52% in 0.243s Worst result for zstd, -18 is same compression as -1.
Best result for xz, although default level compresses to 6.30%
Little 0x12345678 25.43% in 0.568ms 6.37% in 0.257 With a non-zero base, endianness doesn't matter

To Reproduce

I used this python file to generate the test data, toggling the commented out lines.

from array import array

BASE = 0
#BASE = 0x12345678 # uncomment to test non-zero high bytes

a = array('i')
for i in range(1000000):
    a.append(BASE + i)

#a.byteswap() # uncomment to try bigendian

with open('test.bin', 'wb') as f:
    a.tofile(f)

I ran python ints.py && time zstd --format=zstd -18 test.bin -o /dev/null; time zstd --format=xz -1 test.bin -o /dev/null to see the compression ratios and times for zstd and xz. Other formats and levels were also tested, but that is the command I used to produce the table above.

Expected behavior

I know this isn't necessarily the target use case for zstd, but it still seemed to do much worse than I expected. In particular:

  1. The sensitivity to endianness is very surprising to me. It might indicate either a bug or a low hanging fruit for improvement.
  2. The stark change going from level 17 to level 18 for very regular data like this implies that there is some feature enabled at 18 that isn't enabled earlier. Maybe it should be? And if not, it would be nice to know what it is so that use cases where this is likely can opt in to it separately without also using larger window sizes and the like.
  3. I was actually more surprised by how little xz was affected by choice of base value at its default level, but the extent that it affected zstd seemed excessive.
  4. It seemed interesting that the best case for xz -1 was also the worst case for zstd -18.

Screenshots and charts None

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

Cyan4973 commented 2 years ago
RedBeard0531 commented 2 years ago

Thanks for explaining the 3-byte cutoff for ztd. That led me to try expanding the data to 8 byte int64's (using array('q)' and BASE=0x1234567890abcdef in my script). I found that at the default level zstd makes the output significantly smaller than with int32's (1.00Mib vs 2.56Mib, with <1% difference based on base and endian) even though the input is twice as large. That seems to track with finding 1 million matches vs finding none. For comparison xz -1 is consistently around 250Kib, while zstd -18 is 1Mib for little endian and 80Kib for big regardless of base value.

The endianness sensitivity (which is much higher for 8 byte data than 4 byte data) is still very surprising to me. In either case, there is a repeating pattern of a 7 byte match and a 1 byte change. I can't imagine why it would matter what order they come in, since it is just a matter of which byte you start looking at. Unless there is something special about alignment, but I wouldn't expect textual data to be aligned, so I doubt it.

Consequently, as zstd leans more towards speed, its design does not employ these mechanisms.

Well in this specific case, zstd -18 is both slower and less compressing than xz -1. But based on what you said, it may just be a quirk of this specific input source, so perhaps isn't too important to optimize for.

I presume your proposed objective here is to reach the best ratio (~10%) with zstd -18 on all variations of the sample.

My current objective with this investigation is actually somewhat detached from this. I work on a database and we support zstd (among others) as a block compressor for our on-disk pages. For a few reasons, it is both difficult and potentially undesirable to do numeric-specific compression (eg delta-encoding) on the format we use prior to feeding it to the block compressor. I was using this test to get a quick "back of the envelope" estimate for how well zstd will be able to handle cases with a lot of similarish ints, to determine the cost-benefit tradeoff to working on numeric compression at the higher levels of our code. But it now looks like the simple test was way too simple to be useful. I'll need to try out compressing actual pages (separately rather than in bulk) and using our actual encoding (which isn't just an array of fixed size ints). As usual, there are no shortcuts to benchmarking with real data šŸ˜

My broader objective with filing the ticket is to see if this is indicative of some general case where zstd compression can be improved. For example, while I'm using binary ints in my test, I suspect that it has properties in common with large amounts of similarish ascii-decimal numbers (eg, physical sensor readings over time) encoded in csv or json, which seems like a case zstd would care about. However, if you think the observed results are just a quirk of the specific synthetic data set and wouldn't apply to real world use cases that you are targeting, then feel free to close the ticket. I can create a new ticket if I find anything interesting or problematic when I try compressing with our actual page encoding.

PS- since I assume you've spent a lot of time researching many block compression implementations, do you happen to know of any performance-oriented ones that compress well with low-entropy, compact numeric data? Thanks!

Cyan4973 commented 2 years ago

inc1M_6k.zst.gz

This file manages to compress the first 1,000,000 integers in 32-bit little endian format into 6242 bytes of zstd format (recompressed into .gz after that, due to github limitation). This is a compression ratio of 0.15%, or equivalently a compression multiplier of x640. There are a few tricks remaining that could compress it even more, but I don't see the point in pushing it further. It shows that the format could handle this specific sample efficiently.

To achieve this result, I had to guide the search process and take control of block boundaries, so this is not achieved with the "regular" compressor. Flexible Block boundaries is an area of interest where we expect to make some progresses in the future, though this specific scenario benefits a lot from it, and real world data won't see such large gains.

A real achievement would be to reach the same result automatically, using only the existing match finder and cost evaluation mechanisms. That, unfortunately, is unlikely. As a reference point, zstd -18 achieves ~25% on this same sample. note: for some reason, this result is nonetheless noticeably better than the 67% mentioned in your measurements (?).

The compressed file had to be compressed again with gzip, due to a github limitation. Incidentally, that's a good opportunity to note that the compressed file is itself highly compressible. This is because of the extreme regular nature of the data, producing regularity in the compressed stream itself. Such an effect is not expected in presence of "normal" data.

An array of unique 32-bit numbers is kind of the worst case for zstd. In this scenario, there are only 3-bytes matches to be found. And while higher compression levels do feature a 3-bytes match finder, it's pretty weak, and not designed for thorough search.

It might be possible to improve it a bit. Especially, given that the compression results are so different depending on endianness and start values, suggesting some shortcoming. It's unclear at this stage if they can be addressed without breaking other stuff. But even if it is, I wouldn't hold my breath, it will "just" be an improvement, it won't be able to reach the level of performance mentioned earlier in this post.

Compressing an array of 32-bit integers is not a "unique" scenario. But compressing an array of monotonically increasing 32-bit integers is not a good proxy. It features both excessive regularities and excessive difficulties, which are especially detrimental to LZ compression techniques. Compressing such a sample is essentially testing if this specific scenario has received a special optimization attention. It won't be representative of compression of 32-bit integers "in general".

As you have discovered, compression of 64-bit integers is actually easier for zstd. Both its match finder and cost evaluation mechanisms work better in this scenario. I'm still impressed by the result you observed with the big-endian format, which feels intuitively "too good", but there might be a good reason which is just not properly understood yet.

As usual, there are no shortcuts to benchmarking with real data šŸ˜

Indeed !

jfikar commented 2 years ago

I've just tried the little endian with base 0 and bzip2 -9 is not bad (12%), but zpaq -m3 ... -m5 is excellent, going to 0.135%. I'm not sure about the speed though. Zpaq is usually very slow.