elasota commented 3 weeks ago

Found another omission in the spec while trying to figure out if it was possible to encode only one Huffman weight in FSE mode, it appears that the answer is "no" but Zstd's behavior in such a situation is not well-defined.

FSE_decompress_usingDTable_generic will load 2 initial state values from the input stream, but doesn't do any bitstream overflow check. With the current code paths, this results in the state being decoded as zero if it's fully or partially truncated, although the comments in BIT_lookBits state that the results of bit overreads are undefined (and the code path that is currently disabled by a preprocessor defs in BIT_lookBits will produce a different result).

This doesn't necessarily trigger an error due to the Huffman tree being malformed either, if the initial states decode to 0 and 1 (in either order) then it will specify a valid Huffman tree.

I'll try to get a sample that reproduces this behavior.

elasota commented 3 weeks ago

4037 might actually be hitting the same issue.

Cyan4973 commented 3 weeks ago

Indeed, if there is only one Huffman weight, it means there are only 2 symbols. If there is only 1 weight to decode, that means there is no "zero" weights, so the symbols are necessarily 0 and 1, and necessarily use 1 bit each. This is a curious corner case, though I guess it could be made to happen by feeding zstd with a random set of 0 and 1 values.

As you mentioned, it's likely encoded and decoded correctly, though hitting the fse library in a way where it's not obvious how to deal with it. The fse library has a special mode when there is only one symbol with 100% probability, which is necessarily the case when there is only one symbol. It also has a mode to send data "raw" without compression, when there are not enough data, which is also likely the case for a single symbol to encode. Finally, the header which describes the huffman tree is able to send weights directly, without employing fse.

It's been a long time, but if I was to guess, I suspect it's this last solution which is employed.

elasota commented 3 weeks ago

The direct encoding will always be more efficient if there are only 2 values (it will consume 1 byte, whereas FSE-compressed weights will always consume at least 2 bytes) so I don't think the encoder should ever be producing blocks of this sort from any data.

It'd probably only happen with a specially constructed block.

But, this seems like it should be an error... and it only needs to be checked once per block at most so I assume the overhead would be small.

elasota commented 2 weeks ago

Here are some test blocks, the second one is probably the more useful one:

28 b5 2f fd 00 00 55 00 00 72 80 01 04 00 7e ff 03 aa 00 Decompresses to 01 02 01 02 01 02 01

28 b5 2f fd 00 00 55 00 00 72 80 01 04 20 7e 1f 02 aa 00 Decompresses to 00 02 00 02 00 02 00

The structure of both of these is: 4 bytes magic 2 byte frame header (for a compressed block of size 10) 3 byte block header (for compressed size 10 regenerated size 7) 3 byte lit section header (for a lit block with compressed size 6 and regenerated size 7) 1 byte FSE weight desc header (size 4) 2 byte weight probs 2 byte weights 1 byte lits 1 byte sequences (0 sequences)

The second one is a more interesting test case because it will fail to decode if the truncated initial state bits, which are 1111, are decoded as anything other than 0, and will succeed if it decodes as 0.

This doesn't appear to diverge with BMI2 availability, but it will fail to decompress if the preprocessor def in BIT_lookBits is changed.

facebook / zstd

Possibly missing check for truncated initial states in Huffman weight block #4071

4037 might actually be hitting the same issue.