LLNL / zfp

Compressed numerical arrays that support high-speed random access
http://zfp.llnl.gov
BSD 3-Clause "New" or "Revised" License
765 stars 155 forks source link

30x less compression rate then zip #11

Closed snarb closed 6 years ago

snarb commented 6 years ago

I am using tolerance = 0.0 to compress 2d 1000 * 1000 float arrays( a lot of different arrays, so this is not some accident). If for example the array just contains zeros or some value compression rate is high, but when there are several repeated values and some random compression rate is low. it is something like 30%. At the same time, zip library compresses the same array on average 30x better.

lindstro commented 6 years ago

zfp, like all compressors, makes some assumptions on how data values are distributed. It compresses well for distributions it is designed for, but--like all compressors--will do worse and even expand some inputs that violate the assumptions on the data. By the pigeonhole principle, every (lossless) compressor must expand some inputs in order to compress others.

zfp assumes that values vary somewhat smoothly in each dimension and that the least significant floating-point bits are virtually random. This tends to be the case of scientific data like fields generated in the solution of partial differential equations. In fact, zfp assumes that the data represents a function that is regularly sampled and can be well approximated using a low-order Taylor expansion. From the description of your data, it sounds like it does not represent a smooth function. Instead, it sounds like it has patterns that the zip compressor can exploit, but which tend to be rare in scientific floating-point data.

zfp was furthermore designed for lossy compression. By setting the tolerance to zero, I am guessing you are interested in lossless compression. For that, I would suggest trying our fpzip compressor instead.

You may also consult the zfp troubleshooting guide, which discusses reasons why the data may not compress well: http://zfp.readthedocs.io/en/release0.5.2/issues.html. Lastly, I would be happy to take a closer look at your data to see if there are other reasons why it might not compress well.

snarb commented 6 years ago

Thanks, Peter, I got it. Indeed, the data function is not smooth. That arrays are actually equities from poker.(how much one hand is more profitable the other). If you are interested in data here is an example 1326x1326 float array in binary format:
https://drive.google.com/open?id=1m9GNLjJ6VFJTwVMK4o7gh2ItFN7OAxx5 (values are between -1 and 1). The data are zero summed: ar(x, y) = -ar(y, x). I have converted also to 1d array by removing this symmetry.. it doesn't help much for compression. Lossy compression is also an interesting topic for me for some cases, but again data function is not smooth so looks like zipping is a better choice for such function with patterns.

lindstro commented 6 years ago

I looked at this file, and it contains only three different values: {-1, 0, +1}. No wonder zip does so well on this file! :-)

This type of floating-point data is not what zfp was designed for. In fact, if the arrays contain only a few discrete floating-point values, then I would suggest storing a small lookup table and then replacing the floats with 8-bit indices (if sufficient) into that table and applying a conventional byte-based compressor, such as gzip, bzip2, etc. On your example data set, bzip2 is able to reduce it to 47935 bytes, or about 3 times smaller than the zip file.