facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.48k stars 2.09k forks source link

Format documentation on Huffman Coding is unclear on how prefixes are chosen #4163

Open KillingSpark opened 1 day ago

KillingSpark commented 1 day ago

Describe the bug Format documentation on Huffman Coding is unclear on how prefixes are chosen

To Reproduce Read section https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#huffman-coding

Expected behavior Understand how prefixes are chosen based on weights

Additional context The section explains how the number of bits for a given symbol is calculated but it does not really explain how prefix codes are then assigned to the symbols in a given category of number of bits apart from this:

Symbols are sorted by Weight. Within same Weight, symbols keep natural sequential order. Symbols with a Weight of zero are removed. Then, starting from lowest Weight, prefix codes are distributed in sequential order.

Literal 0 1 2 3 4 5
Weight 4 3 2 0 1 1

Sorted by weight and then natural sequential order, it gives the following distribution :

Literal 3 4 5 2 1 0
Weight 0 1 1 2 3 4
Number_of_Bits 0 4 4 3 2 1
prefix codes N/A 0000 0001 001 01 1

I think it would help to be more explicit about how prefix codes of a given length are distributed. Maybe expanding the example slightly so that one rank contains more than 2 but does not fill the rank completely would also help.

For example like this:

Literal 0 1 2 3 4 5 6
Weight 5 4 3 0 1 1 1

Which would (if I understand this correctly?) result in this prefix distribution:

Literal 3 4 5 6 2 1 0
Weight 0 1 1 1 3 4 5
Number_of_Bits 0 5 5 5 3 2 1
prefix codes N/A 00001 00010 00011 001 01 1

Which I would phrase as such:

In each weight, the prefix codes are distributed so that the highest symbol is assigned the highest possible code in the range for that weight. Each following symbol gets assigned the next smaller code. To determine the possible range of codes for a weight compare the number of bits available for the weight in question with that of the next bigger weight. The lowest bits may be set to 1 while the upper rest must be left 0.

Cyan4973 commented 1 day ago

You are onto something.

Note that your second example is not correct, and doesn't produce a valid prefix code. You would need an additional symbol with Weight == 1 or equivalently Number_of_Bits == 5. Then it would nicely complete to a clean power of 2.

I can look at the documentation and see if the wording could be improved.

KillingSpark commented 1 day ago

Ah so this only works if every weight contains a clean power of two symbols? Meaning that in case the alphabet/probabilities I'm trying to use doesn't match that I'd need to fake some symbols and shift around weights to fix that?

Cyan4973 commented 1 day ago

Ah so this only works if every weight contains a clean power of two symbols?

I wouldn't phrase it this way.

For example, the following distribution of nbBits works : 2, 2, 2, 3, 3 which, in turn, becomes the following distribution of Weights: 2, 2, 2, 1, 1.

As one can see, 2 is present 3 times, which is not a power of 2. Yet it defines a valid prefix code.

What matters is that, at the end, the sum of all 2 ^ Weights is itself a clean power of 2.

One of the consequences is that, for this condition to hold, the number of Weight == 1 must necessarily be even.

KillingSpark commented 1 day ago

I think I get it, this is mentioned earlier in the document but a bit hidden in the description of how the decoder can deduce the last weight.

The decoder will do the inverse operation : having collected weights of literal symbols from 0 to 4, it knows the last literal, 5, is present with a non-zero Weight. The Weight of 5 can be determined by advancing to the next power of 2. The sum of 2^(Weight-1) (excluding 0's) is : 8 + 4 + 2 + 0 + 1 = 15. Nearest larger power of 2 value is 16. Therefore, Max_Number_of_Bits = 4 and Weight[5] = log_2(16 - 15) + 1 = 1.

So weights 2, 2, 2, 1, 1 would result in:

Literal 3 4 0 1 2
weight 1 1 2 2 2
nbits 3 3 2 2 2
code 000 001 01 10 11

Maybe this condition could be stated more clearly and that it makes it possible to distribute the codes in this manner.

Cyan4973 commented 22 hours ago

What do you think of the new formulation presented in https://github.com/facebook/zstd/pull/4164/files#diff-19e4c045614d93835d868338566356ea12f2f6cc3e16f6e8771b82681de5dcb8R1403-R1407 ?

KillingSpark commented 21 hours ago

That's an improvement! I think it would be good to mention that the codes are ascending within one weight and then restart at each boundary

Cyan4973 commented 20 hours ago

Well, if you think about it, the ascending order still continues across weights. Just write the prefix codes using less bits with "additional fake bits" so that they all have the same length. The ascending order will become more apparent.

KillingSpark commented 17 hours ago

You mean like this? 0000, 0001, 001, 01, 1 --> 0000, 0001, 1001, 1101, 1111

I guess that makes it ascending but you still need to take special actions on the weight boundaries. For me it's more intuitive to think about the prefixes as the path through a tree and each weight defines a set of paths of the same length. A path of all zeros leads to the next weight, each other path leads to a symbol in that weight. So it makes sense to me that the path resets to 0..01 on each weight boundary.

Cyan4973 commented 17 hours ago

I meant like this : 0000, 0001, 001, 01, 1 0000, 0001, 001x, 01xx, 1xxx

KillingSpark commented 11 hours ago

Oh. Yeah that makes sense! Well maybe it's just me but I think this needs a few words to explain beyond just calling it ascending :)