Open chunbaobao opened 1 month ago
For each $i$, there is a discrete probability distribution $p_i$, i.e.,
$$\int_{\mathbb{R}} p_i(t) \, dt = 1.$$
There are many such probability distributions -- one for each element in $\hat{y}$.
Each element $\hat{y}_i$ is encoded using its corresponding $p_i$. We can measure the "likelihood" $l_i$ for each element:
$$l_i = p_i(\hat{y}_i)$$
...and since the rate is the negative log-likelihood, the bit cost of the $i$-th element is:
$$R_i = -\log_2 l_i$$
The total rate cost is then:
$$R = \sum_i R_i$$
...which can be averaged over $N$ pixels to obtain the bpp.
For a single fixed encoding distribution $p$, the average rate cost for encoding a single symbol that is drawn from the same distribution $p$ is:
$$R = \sum_t - p(t) \, \log p(t)$$
But this is not what we're doing. What we're actually interested in is the cross-entropy. That is the average rate cost for encoding a single symbol drawn from the true distribution $\hat{p}$:
$$R = \sum_t - \hat{p}(t) \, \log p(t)$$
To be consistent with our notation above, we should also sprinkle in some $i$ s:
$$R_i = \sum_t - \hat{p}_i(t) \, \log p_i(t)$$
In our case, we know exactly what $\hat{p}$ is...
$$\hat{p}_i(t) = \delta[t - \hat{y}_i] = \begin{cases} 1 & \text{if } t = \hat{y}_i \ 0 & \text{otherwise} \end{cases} $$
If we plug this into the earlier equation, the rate cost for encoding the $i$-th element becomes:
$$R_i = -\log p_i(\hat{y}_i)$$
Thank you for your reply; it has been incredibly helpful! However, I still have several stupid questions:
update()
function, we need to add pmf_start
to sample, and the symbols
need to subtract offset
in the C++ interface to get $p(y_i - \text{offset}= t+\text{pmfstart}) $, where $t \in {0, 1, 2, \ldots, N} $. I’m just wondering why we can't simply set sample = torch.arange(max_length)
and omit the pmf_start
and offset
steps to get $p(y_i = t) $, where $t \in {0, 1, 2, \ldots, N} $? Whats the meaning of pmf_start
and offset
here? Is it because the min of latent y is not zero? prob = torch.cat((tail_mass[i]/2, p[: pmf_length[i]], tail_mass[i]/2), dim=0)
y
are expected to occur between quantiles[0]
and quantiles[2]
with a total probability of 1 - tail_mass
. That is, P(quantiles[0] ≤ y ≤ quantiles[2]) = 1 - tail_mass
.When converting these to a discrete distribution over the range [0, pmf_length]
, we:
quantiles[0]
and quantiles[2]
outwards (i.e floor and ceil).pmf_length
) and the value that the 0th symbol represents (pmf_start
).pmf
. We need to evaluate it at all the samples/points it represents. samples = torch.arange(max_length) + pmf_start[:, None]
. (The code also includes an empty dimension in the middle for compatibility with _likelihood
.)tail_mass
(should be equal to the sum of our new pmf
).quantized_cdf
that we will feed into the C++ entropy coder.Related links:
P.S. In the EntropyBottleneck
, we use a single distribution per channel, as visualized here in Figure 4a.
quantiles[0]
or on the right of quantiles[2]
) are encoded using the "bypass mode". This is done by encoding a max value symbol, then encoding the out-of-range value with the bypass method. https://github.com/InterDigitalInc/CompressAI/blob/743680befc146a6d8ee7840285584f2ce00c3732/compressai/cpp_exts/rans/rans_interface.cpp#L298-L320Since max value symbols (i.e. out-of-range values to the left of quantiles[0]
or to the right of quantiles[2]
) are expected to occur with probability of tail_mass
, we use:
prob = torch.cat((p[: pmf_length[i]], tail_mass[i]), dim=0)
Notice that the extra out-of-range symbol actually makes the quantized pmf we use of size pmf_length + 1
, and its quantized cdf of size pmf_length + 2
.
Hi, it's an excellent work. I have recently been working in the area of learned compression. I have already checked issue #306, but I am still confused about the actual meaning of
y_likelihoods
:The shape of
y_likelihoods
is equal to the shape ofy
, and I thought each element ofy_likelihoods
represents the probability $p(y_i)$, where $p(y_i)$ is the probability that the symbol $y_i$ appears. However, when I run the code below from the demo,it produces an output where the elements of
y_likelihoods
are close to 1 everywhere. Isn't this contradictory? Shouldn't the sum of y_likelihoods is 1? What is the actual meaning ofy_likelihoods
?The calculation for bpp is $\sum -log_2(p(y_i))/N$, but the information entropy is actually $\sum - p(y_i) log_2(p(y_i))$, I just wandering where did the $p(y_i)$ go? Is it because $\sum -log_2(p(y_i))/N$ is the upper bound of $\sum - p(y_i)\log_2(p(y_i))$ from the inequality below?