InterDigitalInc / CompressAI

A PyTorch library and evaluation platform for end-to-end compression research
https://interdigitalinc.github.io/CompressAI/
BSD 3-Clause Clear License
1.19k stars 232 forks source link

About the y_likelihoods in Entropy models #314

Open chunbaobao opened 1 week ago

chunbaobao commented 1 week ago

Hi, it's an excellent work. I have recently been working in the area of learned compression. I have already checked issue #306, but I am still confused about the actual meaning of y_likelihoods:

  1. The shape of y_likelihoods is equal to the shape of y, and I thought each element of y_likelihoods represents the probability $p(y_i)$, where $p(y_i)$ is the probability that the symbol $y_i$ appears. However, when I run the code below from the demo,

    torch.sum(out_net['likelihoods']['y']).item()/torch.prod(torch.tensor(out_net['likelihoods']['y'].shape)).item()
    # result = 0.9271195729573568

    it produces an output where the elements of y_likelihoods are close to 1 everywhere. Isn't this contradictory? Shouldn't the sum of y_likelihoods is 1? What is the actual meaning of y_likelihoods?

  2. The calculation for bpp is $\sum -log_2(p(y_i))/N$, but the information entropy is actually $\sum - p(y_i) log_2(p(y_i))$, I just wandering where did the $p(y_i)$ go? Is it because $\sum -log_2(p(y_i))/N$ is the upper bound of $\sum - p(y_i)\log_2(p(y_i))$ from the inequality below? image

YodaEmbedding commented 4 days ago

For each $i$, there is a discrete probability distribution $p_i$, i.e.,

$$\int_{\mathbb{R}} p_i(t) \, dt = 1.$$

There are many such probability distributions -- one for each element in $\hat{y}$.

Each element $\hat{y}_i$ is encoded using its corresponding $p_i$. We can measure the "likelihood" $l_i$ for each element:

$$l_i = p_i(\hat{y}_i)$$

...and since the rate is the negative log-likelihood, the bit cost of the $i$-th element is:

$$R_i = -\log_2 l_i$$

The total rate cost is then:

$$R = \sum_i R_i$$

...which can be averaged over $N$ pixels to obtain the bpp.


For a single fixed encoding distribution $p$, the average rate cost for encoding a single symbol that is drawn from the same distribution $p$ is:

$$R = \sum_t - p(t) \, \log p(t)$$

But this is not what we're doing. What we're actually interested in is the cross-entropy. That is the average rate cost for encoding a single symbol drawn from the true distribution $\hat{p}$:

$$R = \sum_t - \hat{p}(t) \, \log p(t)$$

To be consistent with our notation above, we should also sprinkle in some $i$ s:

$$R_i = \sum_t - \hat{p}_i(t) \, \log p_i(t)$$

In our case, we know exactly what $\hat{p}$ is...

$$\hat{p}_i(t) = \delta[t - \hat{y}_i] = \begin{cases} 1 & \text{if } t = \hat{y}_i \ 0 & \text{otherwise} \end{cases} $$

If we plug this into the earlier equation, the rate cost for encoding the $i$-th element becomes:

$$R_i = -\log p_i(\hat{y}_i)$$