LLNL / zfp

Compressed numerical arrays that support high-speed random access
http://zfp.llnl.gov
BSD 3-Clause "New" or "Revised" License
754 stars 152 forks source link

The predicted maximum compressed capacity is much larger than the actual #200

Closed qingyiyi closed 1 year ago

qingyiyi commented 1 year ago

When I try to use the tolerance mode to compress the original 47M data, I set the tolerance to 0.001, but the predicted maximum compressed capacity is 49M, which is larger than my original data, and the actual space used after compression At around 9M, I guess it’s because my data is not evenly distributed, a large amount of data is around 0.01, but some data is greater than 1, and if I use the precision mode, the error of some data will be too large. How can I choose the mode and set the parameters? When the prediction is guaranteed to be similar to the actual situation, the accuracy is still guaranteed.

lindstro commented 1 year ago

The issue here is that zfp has to conservatively estimate how much space is needed for the compressed data and must allow for the worst-case input. Since it knows nothing about the data that is to be compressed, it has to make this very pessimistic assumption. Due to the pigeonhole principle, any compressor must expand some inputs, and in this case that worst-case expansion leads to an estimate of 49 MB, slightly larger than the uncompressed input. Because your data is quite compressible, you're seeing a much smaller compressed size in practice, as is to be expected. One way to mitigate this overallocation is to use C malloc() and realloc() to (re)allocate the memory buffer and relinquish any unused memory.

We're considering adding an option to zfp that in fixed-accuracy mode will quickly scan the data to gather some basic statistics. This would provide a much more accurate estimate but does require access to the data before it is compressed.

The only other suggestion I have is to combine fixed-accuracy mode with a user-set maximum rate via expert mode to cap the compressed size. This, however, would in general not allow the tolerance to be satisfied everywhere.

qingyiyi commented 1 year ago

Thank you for your generous answer which solved my problem!