Cyan4973 / FiniteStateEntropy

New generation entropy codecs : Finite State Entropy and Huff0
BSD 2-Clause "Simplified" License
1.33k stars 143 forks source link

FSE_compress returns 0 even when maxDstSize is still equal to or slightly larger than final compression size #90

Open KrzysFR opened 6 years ago

KrzysFR commented 6 years ago

While adding more unit tests, I found a corner case when trying to compress a document with a maxDstSize which is exactly the expected compressed size (found by a previous compression attempt): in this test, FSE_compress(..) returns 0 (uncompressible data) while previously it was able to compress the same input (given a larger destination buffer).

I'm doing a two-step process: 1) Allocate a buffer large enough (using FSE_compressBound), and compress the source by calling FSE_compress(..., maxDstSize: FSE_compressBound(ORIGINAL_SIZE)), and measure its final compressed size COMPRESSED_SIZE. 2) Repeat the process on the same buffer and payload, but this time calling FSE_compress(..., maxDstSize: COMPRESSED_SIZE). I get a result code of 0 which means that data is uncompressible.

I tried probing the minimum size that will allow compressing the buffer (which is known to be compressible), and each time I need to call FSE_compress(..) with at least COMPRESSED_SIZE + 8. At first I thought it could be a pointer alignment issue, but it is always + 8 bytes whatever compressed size is (by changing the source a bit).

In my test, raw original size is 4,288 bytes, FSE_compressBound return 4,833 bytes, and the compressed size is 2,821 bytes. I need to pass at least 2,823+8 = 2,829 bytes for FSE_compress to succeed (return value > 0).

Is this expected behavior? I'm not sure if the "+8 rule" is true, or if this is random chance with the inputs I'm passing in.

Cyan4973 commented 6 years ago

This is an expected behavior, even if badly documented.

FSE_compress() writes data in "stripes" of size size_t, and will effectively need to write more than the amount of compressed data.

The amount of additional buffer space is not specified, but "+8" rule is about right. It's not documented because this kind of detail is implementation specific, and may change in the future.

Bulat-Ziganshin commented 6 years ago

Yeah, it's our internal machinery which is absolutely unintuitive for library users. I think that this issue may be considered as call for doc improvement. You can say that to ensure compression user should alloc exactly amount of memory computed by maxbound, and that compressor can write beyond size of final compressed data for the sake of speed, "up to 8 bytes in the x.x version, but it may be changed in future"

PS: forgot to say that it may be applicable to [your] other compression libraries as well

KrzysFR commented 6 years ago

I guess that this is not that big of a deal anyway: the chance to guess the compressed size within 8 bytes is low, and if the caller decides to under-allocate (compared to what FSE_compressBound returns) it comes with a high risk in the first place.

You'd only do this either because you are under tight memory constraints, or are dealing with data with very predictable compression ratio (but again, less than 8 bytes of headroom is tight!).

I'll add a warning in my code about that, and update the test to round up the allocated buffer.

Should I close this issue, or do you want to keep it around, if you ever decide to update the doc at a later time?