Cyan4973 / FiniteStateEntropy

New generation entropy codecs : Finite State Entropy and Huff0
BSD 2-Clause "Simplified" License
1.33k stars 143 forks source link

How should we understand this function ‘FSE_normalizeM2’? #108

Open huhu0823 opened 2 years ago

huhu0823 commented 2 years ago

I am new here and still reading the source code.I can't understand these calculations

https://github.com/Cyan4973/FiniteStateEntropy/blob/dev/lib/fse_compress.c Line 415~429 in fse_compress.c

Cyan4973 commented 2 years ago

This is a code I haven't looked at in a while, so I might forget things.

If I remember correctly, this function FSE_normalizeM2() is a backup strategy to distribute statistics across an alphabet. The normalization requires that the sum of all stats be equal to a clean power of 2 total.

Sometimes, this is relatively straightforward to do, but if the code ever reaches this function, that means the initial distribution of statistics failed and was actually difficult to normalize.

It could be for a nb of reasons, a common one being that there are many symbols with a very low yet non-zero probability. Since each symbol needs to receive a frequency of at least 1, they collectively occupy more probability estate than they should. This creates an imbalance, and all other symbols need to make up for this.

Now this can be done in multiple ways, with varying degrees of complexity and efficiency. The code you refer to seems to be the last possible strategy, after all previous ones have been ruled out. I think it evenly spreads statistics across remaining symbols, scaling them to the remaining probability space required to reach the total.

The particular method used here is not important to understand FSE. Any other entropy codec, for example arithmetic coder, will face the same challenge : they need to "normalize" statistics across a clean power of 2 scale. So this could be achieved in any different way. It doesn't change the rest of the encoding process.

huhu0823 commented 2 years ago

This is a code I haven't looked at in a while, so I might forget things.

If I remember correctly, this function FSE_normalizeM2() is a backup strategy to distribute statistics across an alphabet. The normalization requires that the sum of all stats be equal to a clean power of 2 total.

Sometimes, this is relatively straightforward to do, but if the code ever reaches this function, that means the initial distribution of statistics failed and was actually difficult to normalize.

It could be for a nb of reasons, a common one being that there are many symbols with a very low yet non-zero probability. Since each symbol needs to receive a frequency of at least 1, they collectively occupy more probability estate than they should. This creates an imbalance, and all other symbols need to make up for this.

Now this can be done in multiple ways, with varying degrees of complexity and efficiency. The code you refer to seems to be the last possible strategy, after all previous ones have been ruled out. I think it evenly spreads statistics across remaining symbols, scaling them to the remaining probability space required to reach the total.

The particular method used here is not important to understand FSE. Any other entropy codec, for example arithmetic coder, will face the same challenge : they need to "normalize" statistics across a clean power of 2 scale. So this could be achieved in any different way. It doesn't change the rest of the encoding process.

Thank you for your reply!Although I still don't quite understand this, I think it's really not important.Maybe in the future, I can understand it.Thank you again