Open revintec opened 11 months ago
This can happen, though this is not the expectation.
All these algorithms make bets, even the strongest ones. They bet on a certain distribution of statistics, a certain probability of apparition of events in the future, that may turn out to be wrong.
We already know that these bets can be weaponized to make them fail. It is more likely to happen in presence of synthetic data, which tend to be unrepresentative of real-world scenarios. But of course, from time to time, real scenarios can be affected too.
The solution is far from trivial. It generally requires access to source data, in order to understand its behavior, and how it's analyzed by the algorithm. Then, sometimes, an heuristic can be tweaked to better handle this corner case. Other times, it's not, or it comes at the expense of another corner case.
Long term, we could imagine an even stronger (and slower) compression parser that would feature less bets into the future, because it would have access to more information to evaluate trade-offs. Sadly it hasn't happened yet, because it's a very time-consuming task, and we don't have real production workloads which would benefit from this effort. Also, the gains are expected to be small, i.e. < 1% for general cases, excluding pathological corner cases. So this effort tends to be postponed repetitively. It's still in the cards, and I still hope it will happen someday.
For the time being, our best option is to have a look at source data, if it's shareable, and try to learn from it.
> ver & zstd -V & zstd -b1e19 | sort /+2
Microsoft Windows [Version 6.1.7601]
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
16#Synthetic 50% : 10000000 -> 3080678 (x3.246), 6.85 MB/s, 1769.4 MB/s
2#Synthetic 50% : 10000000 -> 3129137 (x3.196), 443.6 MB/s, 1810.9 MB/s
17#Synthetic 50% : 10000000 -> 3136878 (x3.188), 2.92 MB/s, 1537.9 MB/s
19#Synthetic 50% : 10000000 -> 3140424 (x3.184), 2.21 MB/s, 1480.5 MB/s
18#Synthetic 50% : 10000000 -> 3145664 (x3.179), 2.67 MB/s, 1463.4 MB/s
1#Synthetic 50% : 10000000 -> 3154228 (x3.170), 565.7 MB/s, 1854.6 MB/s
-- nonsense in bech --
3#Synthetic 50% : 10000000 -> 3230847 (x3.095), 238.1 MB/s, 1461.9 MB/s
6#Synthetic 50% : 10000000 -> 3280418 (x3.048), 95.7 MB/s, 1200.6 MB/s
5#Synthetic 50% : 10000000 -> 3292183 (x3.037), 108.0 MB/s, 1178.9 MB/s
8#Synthetic 50% : 10000000 -> 3318556 (x3.013), 75.3 MB/s, 1001.4 MB/s
7#Synthetic 50% : 10000000 -> 3327348 (x3.005), 85.8 MB/s, 980.3 MB/s
4#Synthetic 50% : 10000000 -> 3345685 (x2.989), 186.1 MB/s, 1277.5 MB/s
15#Synthetic 50% : 10000000 -> 3353801 (x2.982), 8.00 MB/s, 1068.8 MB/s
14#Synthetic 50% : 10000000 -> 3354678 (x2.981), 9.75 MB/s, 749.2 MB/s
13#Synthetic 50% : 10000000 -> 3354692 (x2.981), 9.28 MB/s, 1068.3 MB/s
9#Synthetic 50% : 10000000 -> 3355994 (x2.980), 54.2 MB/s, 834.1 MB/s
12#Synthetic 50% : 10000000 -> 3362882 (x2.974), 24.1 MB/s, 608.8 MB/s
10#Synthetic 50% : 10000000 -> 3363166 (x2.973), 35.3 MB/s, 1030.5 MB/s
11#Synthetic 50% : 10000000 -> 3363170 (x2.973), 25.8 MB/s, 1045.0 MB/s
Current situation regarding this specific scenario :
1#issue3793.bin : 262144 -> 204103 (x1.284),
2#issue3793.bin : 262144 -> 204118 (x1.284),
3#issue3793.bin : 262144 -> 204162 (x1.284),
4#issue3793.bin : 262144 -> 204136 (x1.284),
5#issue3793.bin : 262144 -> 204147 (x1.284),
6#issue3793.bin : 262144 -> 204141 (x1.284),
7#issue3793.bin : 262144 -> 204161 (x1.284),
8#issue3793.bin : 262144 -> 204161 (x1.284),
9#issue3793.bin : 262144 -> 204161 (x1.284),
10#issue3793.bin : 262144 -> 204161 (x1.284),
11#issue3793.bin : 262144 -> 204165 (x1.284),
12#issue3793.bin : 262144 -> 204161 (x1.284),
13#issue3793.bin : 262144 -> 204143 (x1.284),
14#issue3793.bin : 262144 -> 83240 (x3.149),
15#issue3793.bin : 262144 -> 83240 (x3.149),
16#issue3793.bin : 262144 -> 83242 (x3.149),
17#issue3793.bin : 262144 -> 83242 (x3.149),
18#issue3793.bin : 262144 -> 82866 (x3.163),
19#issue3793.bin : 262144 -> 82866 (x3.163),
20#issue3793.bin : 262144 -> 82866 (x3.163),
21#issue3793.bin : 262144 -> 82866 (x3.163),
22#issue3793.bin : 262144 -> 82866 (x3.163),
Compression ratio could be even better at higher compression levels, but at least performance doesn't degrade anymore.
I'll provide some context here: these extremely regular data may be quite "real" and "common" actually, think about index files I think they're currently rare(in issue reports and/or testing corpus) because current compression algorithms can't handle them well, thus users give up and hand write some delta compression themselves(it's easy to write and solves the problem quickly
I've seen them on other occasions like elastic-search's index file too they're also commonly pre-compressed using the delta algorithm([100,130,160,190,200,210,240...] -> [100,30,30,30,30,10,10,30...]), making it easier for compression algorithms
the original data I referenced is from a tsdb timestamp, thus they increment 30(seconds) for every sample, with some increments like 3600(1 hour) and/or 86400(1 day). the original code used the hand written delta compression(with varint and other tricks) but no zstd. I tried to replace the hand written algorithm with zstd to see if I can make the code more general and easier to maintain, while still providing comparable compression benefits
the unexpected is not (only) the compression ratio per say, but also the higher compression level resulting in noticeably larger size. I thought that higher level would internally try all heuristics in all lower levels
- does Improve compression of Arrays of Integers (High compression mode) #3895 works on data other than int arrays? e.g. long array or double array, etc
Yes, but I believe the situation was not as bad for arrays of 64-bit values. Therefore, comparatively, arrays of 32-bit receive a more substantial boost.
- is it possible for zstd to produce smaller files for these cases? e.g. doing a new kind of afore-mentioned pre-delta + varint heuristic in high compression levels
Not without breaking the format. Such preprocessing would be better provided by a higher layer.
using zstd 1.5.5, latest version as of writing prepare an int array(each int occupies 4 bytes, little endian)
[0,30,60,90,...]
65536 ints, 65536*4 bytes then compress it using various compression levels(simple compression, no dict):as seen from the above output, higher compression level(18) starts resulting in larger compressed data
compression level size
in issues results in no relative information in the first page, nor relative result in google :( sorry if this has already been brought upand there's a related questions I'm putting into a same issue(forgive me :)