Open RedBeard0531 opened 2 years ago
--zstd=minMatch=3
command
xz
has additional mechanisms which are especially helpful for compression ratio in this specific case.
zstd
leans more towards speed, its design does not employ these mechanisms. It may be able to compensate partially, but on this specific sample set, it will always be at a disadvantage.zstd -18
on all variations of the sample.Thanks for explaining the 3-byte cutoff for ztd. That led me to try expanding the data to 8 byte int64's (using array('q)'
and BASE=0x1234567890abcdef
in my script). I found that at the default level zstd makes the output significantly smaller than with int32's (1.00Mib vs 2.56Mib, with <1% difference based on base and endian) even though the input is twice as large. That seems to track with finding 1 million matches vs finding none. For comparison xz -1
is consistently around 250Kib, while zstd -18
is 1Mib for little endian and 80Kib for big regardless of base value.
The endianness sensitivity (which is much higher for 8 byte data than 4 byte data) is still very surprising to me. In either case, there is a repeating pattern of a 7 byte match and a 1 byte change. I can't imagine why it would matter what order they come in, since it is just a matter of which byte you start looking at. Unless there is something special about alignment, but I wouldn't expect textual data to be aligned, so I doubt it.
Consequently, as zstd leans more towards speed, its design does not employ these mechanisms.
Well in this specific case, zstd -18 is both slower and less compressing than xz -1. But based on what you said, it may just be a quirk of this specific input source, so perhaps isn't too important to optimize for.
I presume your proposed objective here is to reach the best ratio (~10%) with zstd -18 on all variations of the sample.
My current objective with this investigation is actually somewhat detached from this. I work on a database and we support zstd (among others) as a block compressor for our on-disk pages. For a few reasons, it is both difficult and potentially undesirable to do numeric-specific compression (eg delta-encoding) on the format we use prior to feeding it to the block compressor. I was using this test to get a quick "back of the envelope" estimate for how well zstd will be able to handle cases with a lot of similarish ints, to determine the cost-benefit tradeoff to working on numeric compression at the higher levels of our code. But it now looks like the simple test was way too simple to be useful. I'll need to try out compressing actual pages (separately rather than in bulk) and using our actual encoding (which isn't just an array of fixed size ints). As usual, there are no shortcuts to benchmarking with real data š
My broader objective with filing the ticket is to see if this is indicative of some general case where zstd compression can be improved. For example, while I'm using binary ints in my test, I suspect that it has properties in common with large amounts of similarish ascii-decimal numbers (eg, physical sensor readings over time) encoded in csv or json, which seems like a case zstd would care about. However, if you think the observed results are just a quirk of the specific synthetic data set and wouldn't apply to real world use cases that you are targeting, then feel free to close the ticket. I can create a new ticket if I find anything interesting or problematic when I try compressing with our actual page encoding.
PS- since I assume you've spent a lot of time researching many block compression implementations, do you happen to know of any performance-oriented ones that compress well with low-entropy, compact numeric data? Thanks!
This file manages to compress the first 1,000,000 integers in 32-bit little endian format into 6242 bytes of zstd
format (recompressed into .gz
after that, due to github limitation). This is a compression ratio of 0.15%, or equivalently a compression multiplier of x640. There are a few tricks remaining that could compress it even more, but I don't see the point in pushing it further. It shows that the format could handle this specific sample efficiently.
To achieve this result, I had to guide the search process and take control of block boundaries, so this is not achieved with the "regular" compressor. Flexible Block boundaries is an area of interest where we expect to make some progresses in the future, though this specific scenario benefits a lot from it, and real world data won't see such large gains.
A real achievement would be to reach the same result automatically, using only the existing match finder and cost evaluation mechanisms. That, unfortunately, is unlikely. As a reference point, zstd -18
achieves ~25% on this same sample.
note: for some reason, this result is nonetheless noticeably better than the 67% mentioned in your measurements (?).
The compressed file had to be compressed again with gzip
, due to a github limitation. Incidentally, that's a good opportunity to note that the compressed file is itself highly compressible. This is because of the extreme regular nature of the data, producing regularity in the compressed stream itself. Such an effect is not expected in presence of "normal" data.
An array of unique 32-bit numbers is kind of the worst case for zstd
.
In this scenario, there are only 3-bytes matches to be found.
And while higher compression levels do feature a 3-bytes match finder, it's pretty weak, and not designed for thorough search.
It might be possible to improve it a bit. Especially, given that the compression results are so different depending on endianness and start values, suggesting some shortcoming. It's unclear at this stage if they can be addressed without breaking other stuff. But even if it is, I wouldn't hold my breath, it will "just" be an improvement, it won't be able to reach the level of performance mentioned earlier in this post.
Compressing an array of 32-bit integers is not a "unique" scenario. But compressing an array of monotonically increasing 32-bit integers is not a good proxy. It features both excessive regularities and excessive difficulties, which are especially detrimental to LZ compression techniques. Compressing such a sample is essentially testing if this specific scenario has received a special optimization attention. It won't be representative of compression of 32-bit integers "in general".
As you have discovered, compression of 64-bit integers is actually easier for zstd
. Both its match finder and cost evaluation mechanisms work better in this scenario. I'm still impressed by the result you observed with the big-endian format, which feels intuitively "too good", but there might be a good reason which is just not properly understood yet.
As usual, there are no shortcuts to benchmarking with real data š
Indeed !
I've just tried the little endian with base 0 and bzip2 -9 is not bad (12%), but zpaq -m3 ... -m5 is excellent, going to 0.135%. I'm not sure about the speed though. Zpaq is usually very slow.
Describe the bug
I wrote a simple test to produce a file with 1 million 4-byte incrementing ints, and tried to compress with various settings. I tried big and little endian, and I tried starting at both 0 and 0x1234567 to fill all bytes. I'm sure there are a lot more interesting cases, but this seemed like a good starting point and revealed a lot already. Zstd generally did fairly poorly, but there were some interesting results that give me a glimmer of hope that there is room for significant improvement.
Interesting findings:
--format=xz -1
still compresses this data much better and faster than--format=zstd -18
-1
when testing xz as I intended to. When I added it and reran the testing, only one realy changed size, and it got smaller...)Results:
Best result for xz, although default level compresses to 6.30%
To Reproduce
I used this python file to generate the test data, toggling the commented out lines.
I ran
python ints.py && time zstd --format=zstd -18 test.bin -o /dev/null; time zstd --format=xz -1 test.bin -o /dev/null
to see the compression ratios and times for zstd and xz. Other formats and levels were also tested, but that is the command I used to produce the table above.Expected behavior
I know this isn't necessarily the target use case for zstd, but it still seemed to do much worse than I expected. In particular:
xz -1
was also the worst case forzstd -18
.Screenshots and charts None
Desktop (please complete the following information):
*** zstd command line interface 64-bits v1.5.1, by Yann Collet ***
Additional context Add any other context about the problem here.