Open KillingSpark opened 1 day ago
You are onto something.
Note that your second example is not correct, and doesn't produce a valid prefix code.
You would need an additional symbol with Weight == 1
or equivalently Number_of_Bits == 5
.
Then it would nicely complete to a clean power of 2.
I can look at the documentation and see if the wording could be improved.
Ah so this only works if every weight contains a clean power of two symbols? Meaning that in case the alphabet/probabilities I'm trying to use doesn't match that I'd need to fake some symbols and shift around weights to fix that?
Ah so this only works if every weight contains a clean power of two symbols?
I wouldn't phrase it this way.
For example, the following distribution of nbBits
works :
2, 2, 2, 3, 3
which, in turn, becomes the following distribution of Weights
:
2, 2, 2, 1, 1
.
As one can see, 2
is present 3 times, which is not a power of 2. Yet it defines a valid prefix code.
What matters is that, at the end, the sum of all 2 ^ Weights
is itself a clean power of 2.
One of the consequences is that, for this condition to hold, the number of Weight == 1
must necessarily be even.
I think I get it, this is mentioned earlier in the document but a bit hidden in the description of how the decoder can deduce the last weight.
The decoder will do the inverse operation : having collected weights of literal symbols from 0 to 4, it knows the last literal, 5, is present with a non-zero Weight. The Weight of 5 can be determined by advancing to the next power of 2. The sum of 2^(Weight-1) (excluding 0's) is : 8 + 4 + 2 + 0 + 1 = 15. Nearest larger power of 2 value is 16. Therefore, Max_Number_of_Bits = 4 and Weight[5] = log_2(16 - 15) + 1 = 1.
So weights 2, 2, 2, 1, 1
would result in:
Literal | 3 | 4 | 0 | 1 | 2 |
---|---|---|---|---|---|
weight | 1 | 1 | 2 | 2 | 2 |
nbits | 3 | 3 | 2 | 2 | 2 |
code | 000 | 001 | 01 | 10 | 11 |
Maybe this condition could be stated more clearly and that it makes it possible to distribute the codes in this manner.
What do you think of the new formulation presented in https://github.com/facebook/zstd/pull/4164/files#diff-19e4c045614d93835d868338566356ea12f2f6cc3e16f6e8771b82681de5dcb8R1403-R1407 ?
That's an improvement! I think it would be good to mention that the codes are ascending within one weight and then restart at each boundary
Well, if you think about it, the ascending order still continues across weights. Just write the prefix codes using less bits with "additional fake bits" so that they all have the same length. The ascending order will become more apparent.
You mean like this?
0000, 0001, 001, 01, 1
-->
0000, 0001, 1001, 1101, 1111
I guess that makes it ascending but you still need to take special actions on the weight boundaries. For me it's more intuitive to think about the prefixes as the path through a tree and each weight defines a set of paths of the same length. A path of all zeros leads to the next weight, each other path leads to a symbol in that weight. So it makes sense to me that the path resets to 0..01
on each weight boundary.
I meant like this :
0000, 0001, 001, 01, 1
0000, 0001, 001x, 01xx, 1xxx
Oh. Yeah that makes sense! Well maybe it's just me but I think this needs a few words to explain beyond just calling it ascending :)
Describe the bug Format documentation on Huffman Coding is unclear on how prefixes are chosen
To Reproduce Read section https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#huffman-coding
Expected behavior Understand how prefixes are chosen based on weights
Additional context The section explains how the number of bits for a given symbol is calculated but it does not really explain how prefix codes are then assigned to the symbols in a given category of number of bits apart from this:
I think it would help to be more explicit about how prefix codes of a given length are distributed. Maybe expanding the example slightly so that one rank contains more than 2 but does not fill the rank completely would also help.
For example like this:
Weight
Which would (if I understand this correctly?) result in this prefix distribution:
Weight
Number_of_Bits
Which I would phrase as such:
In each weight, the prefix codes are distributed so that the highest symbol is assigned the highest possible code in the range for that weight. Each following symbol gets assigned the next smaller code. To determine the possible range of codes for a weight compare the number of bits available for the weight in question with that of the next bigger weight. The lowest bits may be set to 1 while the upper rest must be left 0.