Closed elasota closed 4 days ago
Indeed,
if there is only one Huffman weight,
it means there are only 2 symbols.
If there is only 1 weight to decode, that means there is no "zero" weights, so the symbols are necessarily 0
and 1
, and necessarily use 1 bit each.
This is a curious corner case, though I guess it could be made to happen by feeding zstd
with a random set of 0
and 1
values.
As you mentioned, it's likely encoded and decoded correctly, though hitting the fse
library in a way where it's not obvious how to deal with it.
The fse
library has a special mode when there is only one symbol with 100% probability, which is necessarily the case when there is only one symbol. It also has a mode to send data "raw" without compression, when there are not enough data, which is also likely the case for a single symbol to encode.
Finally, the header which describes the huffman tree is able to send weights directly, without employing fse
.
It's been a long time, but if I was to guess, I suspect it's this last solution which is employed.
The direct encoding will always be more efficient if there are only 2 values (it will consume 1 byte, whereas FSE-compressed weights will always consume at least 2 bytes) so I don't think the encoder should ever be producing blocks of this sort from any data.
It'd probably only happen with a specially constructed block.
But, this seems like it should be an error... and it only needs to be checked once per block at most so I assume the overhead would be small.
Here are some test blocks, the second one is probably the more useful one:
28 b5 2f fd 00 00 55 00 00 72 80 01 04 00 7e ff 03 aa 00
Decompresses to 01 02 01 02 01 02 01
28 b5 2f fd 00 00 55 00 00 72 80 01 04 20 7e 1f 02 aa 00
Decompresses to 00 02 00 02 00 02 00
The structure of both of these is: 4 bytes magic 2 byte frame header (for a compressed block of size 10) 3 byte block header (for compressed size 10 regenerated size 7) 3 byte lit section header (for a lit block with compressed size 6 and regenerated size 7) 1 byte FSE weight desc header (size 4) 2 byte weight probs 2 byte weights 1 byte lits 1 byte sequences (0 sequences)
The second one is a more interesting test case because it will fail to decode if the truncated initial state bits, which are 1111
, are decoded as anything other than 0, and will succeed if it decodes as 0.
This doesn't appear to diverge with BMI2 availability, but it will fail to decompress if the preprocessor def in BIT_lookBits
is changed.
Found another omission in the spec while trying to figure out if it was possible to encode only one Huffman weight in FSE mode, it appears that the answer is "no" but Zstd's behavior in such a situation is not well-defined.
FSE_decompress_usingDTable_generic
will load 2 initial state values from the input stream, but doesn't do any bitstream overflow check. With the current code paths, this results in the state being decoded as zero if it's fully or partially truncated, although the comments inBIT_lookBits
state that the results of bit overreads are undefined (and the code path that is currently disabled by a preprocessor defs inBIT_lookBits
will produce a different result).This doesn't necessarily trigger an error due to the Huffman tree being malformed either, if the initial states decode to 0 and 1 (in either order) then it will specify a valid Huffman tree.
I'll try to get a sample that reproduces this behavior.