Poor compression of binary numeric data with surprising results

Describe the bug

I wrote a simple test to produce a file with 1 million 4-byte incrementing ints, and tried to compress with various settings. I tried big and little endian, and I tried starting at both 0 and 0x1234567 to fill all bytes. I'm sure there are a lot more interesting cases, but this seemed like a good starting point and revealed a lot already. Zstd generally did fairly poorly, but there were some interesting results that give me a glimmer of hope that there is room for significant improvement.

Interesting findings:

Compression levels 1 through 17 behave roughly the same regardless of input, reducing to roughly 67% of initial size.
Something changes at level 18 and for most inputs it gives significantly better results
--format=xz -1 still compresses this data much better and faster than --format=zstd -18
Zstd is very sensitive to endianness and base value, while xz barely changes (I wrote this before noticing that I wasn't passing -1 when testing xz as I intended to. When I added it and reran the testing, only one realy changed size, and it got smaller...)

Results:

endian	base	zstd -18	xz -1	notes
Big	0	10.46% in 0.791s	6.47% in 0.272s	best result for zstd
Big	0x1234567	26.84% in 0.578s	6.38% in 0.263s
Little	0	67.21% in 0.532s	2.52% in 0.243s	Worst result for zstd, -18 is same compression as -1. Best result for xz, although default level compresses to 6.30%
Little	0x12345678	25.43% in 0.568ms	6.37% in 0.257	With a non-zero base, endianness doesn't matter

To Reproduce

I used this python file to generate the test data, toggling the commented out lines.

from array import array

BASE = 0
#BASE = 0x12345678 # uncomment to test non-zero high bytes

a = array('i')
for i in range(1000000):
    a.append(BASE + i)

#a.byteswap() # uncomment to try bigendian

with open('test.bin', 'wb') as f:
    a.tofile(f)

I ran python ints.py && time zstd --format=zstd -18 test.bin -o /dev/null; time zstd --format=xz -1 test.bin -o /dev/null to see the compression ratios and times for zstd and xz. Other formats and levels were also tested, but that is the command I used to produce the table above.

Expected behavior

I know this isn't necessarily the target use case for zstd, but it still seemed to do much worse than I expected. In particular:

The sensitivity to endianness is very surprising to me. It might indicate either a bug or a low hanging fruit for improvement.
The stark change going from level 17 to level 18 for very regular data like this implies that there is some feature enabled at 18 that isn't enabled earlier. Maybe it should be? And if not, it would be nice to know what it is so that use cases where this is likely can opt in to it separately without also using larger window sizes and the like.
I was actually more surprised by how little xz was affected by choice of base value at its default level, but the extent that it affected zstd seemed excessive.
It seemed interesting that the best case for xz -1 was also the worst case for zstd -18.

Screenshots and charts None

Desktop (please complete the following information):

OS: Arch Linux in WSL2
Version: *** zstd command line interface 64-bits v1.5.1, by Yann Collet ***
Compiler/Flags: I used the default arch packages. I think they are using gcc -O2.
Other relevant hardware specs: 8-core i9-9880H
Build system: pacman

Additional context Add any other context about the problem here.

Level 18 triggers the 3-bytes match finder, which is disabled before that point (on large data).
- Since the data consists only of non-repeated 4-bytes fields, there are only 3-bytes matches.
- This setting can be opted in using --zstd=minMatch=3 command
  - However, 3-bytes match finder doesn't work at levels 15 and below, and is replaced there by the 4-bytes match finder, which will not give good results for this specific data set.
- Given how this synthetic data is generated, it's no surprise that it's so sensible to this parameter.
xz has additional mechanisms which are especially helpful for compression ratio in this specific case.
- These mechanisms are situational, it depends on the data source. For example on text data, they don't bring any benefit, and may even mess up things. On the other hand, they always cost speed.
- Consequently, as zstd leans more towards speed, its design does not employ these mechanisms. It may be able to compensate partially, but on this specific sample set, it will always be at a disadvantage.
I presume your proposed objective here is to reach the best ratio (~10%) with zstd -18 on all variations of the sample.

Thanks for explaining the 3-byte cutoff for ztd. That led me to try expanding the data to 8 byte int64's (using array('q)' and BASE=0x1234567890abcdef in my script). I found that at the default level zstd makes the output significantly smaller than with int32's (1.00Mib vs 2.56Mib, with <1% difference based on base and endian) even though the input is twice as large. That seems to track with finding 1 million matches vs finding none. For comparison xz -1 is consistently around 250Kib, while zstd -18 is 1Mib for little endian and 80Kib for big regardless of base value.

The endianness sensitivity (which is much higher for 8 byte data than 4 byte data) is still very surprising to me. In either case, there is a repeating pattern of a 7 byte match and a 1 byte change. I can't imagine why it would matter what order they come in, since it is just a matter of which byte you start looking at. Unless there is something special about alignment, but I wouldn't expect textual data to be aligned, so I doubt it.

Consequently, as zstd leans more towards speed, its design does not employ these mechanisms.

Well in this specific case, zstd -18 is both slower and less compressing than xz -1. But based on what you said, it may just be a quirk of this specific input source, so perhaps isn't too important to optimize for.

I presume your proposed objective here is to reach the best ratio (~10%) with zstd -18 on all variations of the sample.

My current objective with this investigation is actually somewhat detached from this. I work on a database and we support zstd (among others) as a block compressor for our on-disk pages. For a few reasons, it is both difficult and potentially undesirable to do numeric-specific compression (eg delta-encoding) on the format we use prior to feeding it to the block compressor. I was using this test to get a quick "back of the envelope" estimate for how well zstd will be able to handle cases with a lot of similarish ints, to determine the cost-benefit tradeoff to working on numeric compression at the higher levels of our code. But it now looks like the simple test was way too simple to be useful. I'll need to try out compressing actual pages (separately rather than in bulk) and using our actual encoding (which isn't just an array of fixed size ints). As usual, there are no shortcuts to benchmarking with real data 😁

My broader objective with filing the ticket is to see if this is indicative of some general case where zstd compression can be improved. For example, while I'm using binary ints in my test, I suspect that it has properties in common with large amounts of similarish ascii-decimal numbers (eg, physical sensor readings over time) encoded in csv or json, which seems like a case zstd would care about. However, if you think the observed results are just a quirk of the specific synthetic data set and wouldn't apply to real world use cases that you are targeting, then feel free to close the ticket. I can create a new ticket if I find anything interesting or problematic when I try compressing with our actual page encoding.

PS- since I assume you've spent a lot of time researching many block compression implementations, do you happen to know of any performance-oriented ones that compress well with low-entropy, compact numeric data? Thanks!

inc1M_6k.zst.gz

This file manages to compress the first 1,000,000 integers in 32-bit little endian format into 6242 bytes of zstd format (recompressed into .gz after that, due to github limitation). This is a compression ratio of 0.15%, or equivalently a compression multiplier of x640. There are a few tricks remaining that could compress it even more, but I don't see the point in pushing it further. It shows that the format could handle this specific sample efficiently.

To achieve this result, I had to guide the search process and take control of block boundaries, so this is not achieved with the "regular" compressor. Flexible Block boundaries is an area of interest where we expect to make some progresses in the future, though this specific scenario benefits a lot from it, and real world data won't see such large gains.

A real achievement would be to reach the same result automatically, using only the existing match finder and cost evaluation mechanisms. That, unfortunately, is unlikely. As a reference point, zstd -18 achieves ~25% on this same sample. note: for some reason, this result is nonetheless noticeably better than the 67% mentioned in your measurements (?).

The compressed file had to be compressed again with gzip, due to a github limitation. Incidentally, that's a good opportunity to note that the compressed file is itself highly compressible. This is because of the extreme regular nature of the data, producing regularity in the compressed stream itself. Such an effect is not expected in presence of "normal" data.

An array of unique 32-bit numbers is kind of the worst case for zstd. In this scenario, there are only 3-bytes matches to be found. And while higher compression levels do feature a 3-bytes match finder, it's pretty weak, and not designed for thorough search.

It might be possible to improve it a bit. Especially, given that the compression results are so different depending on endianness and start values, suggesting some shortcoming. It's unclear at this stage if they can be addressed without breaking other stuff. But even if it is, I wouldn't hold my breath, it will "just" be an improvement, it won't be able to reach the level of performance mentioned earlier in this post.

Compressing an array of 32-bit integers is not a "unique" scenario. But compressing an array of monotonically increasing 32-bit integers is not a good proxy. It features both excessive regularities and excessive difficulties, which are especially detrimental to LZ compression techniques. Compressing such a sample is essentially testing if this specific scenario has received a special optimization attention. It won't be representative of compression of 32-bit integers "in general".

As you have discovered, compression of 64-bit integers is actually easier for zstd. Both its match finder and cost evaluation mechanisms work better in this scenario. I'm still impressed by the result you observed with the big-endian format, which feels intuitively "too good", but there might be a good reason which is just not properly understood yet.

As usual, there are no shortcuts to benchmarking with real data 😁

Indeed !

I've just tried the little endian with base 0 and bzip2 -9 is not bad (12%), but zpaq -m3 ... -m5 is excellent, going to 0.135%. I'm not sure about the speed though. Zpaq is usually very slow.

facebook / zstd

Poor compression of binary numeric data with surprising results #3014