InterDigitalInc / CompressAI

A PyTorch library and evaluation platform for end-to-end compression research
https://interdigitalinc.github.io/CompressAI/
BSD 3-Clause Clear License
1.15k stars 228 forks source link

sensetime and google #293

Closed formioq closed 3 months ago

formioq commented 3 months ago

Sorry to bother you. I noticed that when using the cheng2020-anchor-checkerboard model and subsequent models in sensetime.py, the components of the model seem to differ from those in google.py. Is this because the likelihood calculation differs from the models in google.py? Could you please explain why the components are not the same?

formioq commented 3 months ago

Actually, what I want to ask more about is the calculation of likelihood in the context model. I noticed that in mbt2018, the context model is only used to provide parameters to the EP, but in cheng2020-anchor-checkerboard, the CheckerboardLatentCodec makes it a bit harder for me to understand. Sorry if my explanation is a bit confusing.

YodaEmbedding commented 3 months ago

The checkerboard context model computes the mean/scale entropy params in two steps. Then, the likelihood computation is the same as usual.

class CheckerboardLatentCodec(LatentCodec):
    def _forward_twopass(self, y: Tensor, side_params: Tensor) -> Dict[str, Any]:
        ...
        y_out = self.latent_codec["y"](y, params)  # This calls GCLC.forward
        return {
            "likelihoods": {
                "y": y_out["likelihoods"]["y"],
            },
            "y_hat": y_hat,
        }

class GaussianConditionalLatentCodec(LatentCodec):
    def forward(self, y: Tensor, ctx_params: Tensor) -> Dict[str, Any]:
        gaussian_params = ctx_params
        scales_hat, means_hat = self._chunk(gaussian_params)
        y_hat, y_likelihoods = self.gaussian_conditional(y, scales_hat, means=means_hat)
        if self.quantizer == "ste":
            y_hat = quantize_ste(y - means_hat) + means_hat
        return {"likelihoods": {"y": y_likelihoods}, "y_hat": y_hat}

Let me know if you have additional questions.


The main reason for these components is composability/reusability, though I suppose I may have gone a bit overboard when implementing them. I'll look into simplifying/commenting them a bit more in the future.

formioq commented 3 months ago

Thank you very much for your explanation. I might have some additional questions. Specifically, if I don't adopt the two-step coding architecture described in the paper, but instead directly replace the contextmodel from MaskedConv2d to CheckerboardMaskedConv2d in an architecture like JointAutoregressiveHierarchicalPriors (mbt2018), it seems all components can still function normally (since the input and output data shapes remain the same). I would like to ask whether such a modification could potentially cause any issues in likelihood calculation, PSNR calculation, or any other aspects?

formioq commented 3 months ago

This might be somewhat similar to an issue I raised previously... However, I must admit I didn't fully understand the previous explanation either. :(

YodaEmbedding commented 3 months ago

The models contain code intended for different decoding orders. Things won't work correctly if only the masked conv is replaced. It is necessary to also replace the entire context model code used to compress/decompress the latent (i.e. "latent codec"). If that code isn't also replaced, the entropy parameters (means/scales) will be incorrectly estimated.

Likelihood is computed from the entropy parameters (means/scales). If those parameters are calculated correctly, everything else should work without further modifications.

This overview may be helpful.


1 with some slight inefficiency due to quantization/causality

formioq commented 3 months ago

Thank you again for your answer, it has been very beneficial to me. I hope you don't mind if I ask some further questions based on your previous response.

Firstly, regarding the statement "Likelihood is computed from the entropy parameters (means/scales)," the y likelihood is calculated by the gaussian_conditional function which takes means/scales as input. These means/scales are derived from the entropy_parameters function. So, does the phrase "the entropy parameters (means/scales) will be incorrectly estimated" mean that the entropy_parameters function will not work correctly (i.e., the input to entropy_parameters is problematic)?

In mbt2018, the EP (entropy parameters) accepts "torch.cat((params, ctx_params), dim=1)" as input, where params come from the h_s decoding, and ctx_hat comes from the MaskedConv2d convolution. In Cheng2020AnchorCheckerboard, the EP seems to be used twice:

The first time, it accepts "self.merge(y_ctx, side_params)" as input to get params_i (where y ctx is all zeros in the first step, and comes from CheckerboardMaskedConv2d in the second step, and side_params comes from hyper's h_s decoding?).

The second time, it is directed to GCLC by "func = getattr(self.latent_codec["y"], "entropy_parameters", lambda x: x)." Was entropy_parameters actually used here? Because in Cheng2020AnchorCheckerboard instantiation, entropy_parameters seem to be provided only at the CheckerboardLatentCodec layer, and the instantiation of "y": GaussianConditionalLatentCodec(quantizer="ste"), does not provide entropy_parameters, which defaults to x: x?

Back to the initial question: when entropy_parameters do not work correctly, the issue lies in the input y ctx being incorrect. It should come from two-step decoding, while directly replacing context_prediction leads to problems.

Is there any logical error in my understanding here?

Secondly, the scope of the context model definition: Where does the entire context model start and end? For example, in mbt2018, can ctx_hat=self.context_prediction(y_hat) be considered the entire context model in this architecture (i.e., the process from input y_hat to output ctx_hat)? Then how should I understand the scope of the entire context model in Cheng2020AnchorCheckerboard? Does it include EP, and even the whole CheckerboardLatentCodec, including GCLC?

Thirdly, drawing the model structure: I think that drawing the architecture for each model in google.py is a good way to help me understand the whole running process, or integrating a complete forward method to quickly understand the data flow in the model. I wonder if you could draw a simple schematic diagram of the Cheng2020AnchorCheckerboard and even Elic2022Official model architectures? (I find that the model structure diagrams in papers are not as clear and straightforward as those in google.py).

I apologize for bothering you again. Your patient answers have helped an undergraduate student a lot. Looking forward to your reply!

YodaEmbedding commented 3 months ago

So, does the phrase "the entropy parameters (means/scales) will be incorrectly estimated" mean that the entropy_parameters function will not work correctly (i.e., the input to entropy_parameters is problematic)?

The purpose of the entropy parameters $\mu{i,j,k}$ and $\sigma{i,j,k}$ is to model a Gaussian distribution $\mathcal{N}(\mu{i,j,k}, \sigma{i,j,k}^2)$ for a tensor element $y_{i,j,k}$. The closer $\mu$ is to $y$ and the smaller $\sigma$ is, the less bits are used for encoding that element. Ideally, if $\mu = y$ and $\sigma = 0$, then the rate cost for that element will be 0 bits since it can be exactly and confidently predicted by the decoder.

The context model uses a nearby context of previously decoded pixels to help estimate the entropy parameters for the next pixel.

Let's say you were to train using a masked 3x3 checkerboard context, and then use the code for the raster-scan context at runtime. In that case, the elements $\{ y{:,j+1,k}, \ y{:,j,k+1} \}$ would not be available while decoding. Thus, the context model would only have access to half the elements it needs. But the model expects these elements to be there to help it predict a good set of entropy parameters for the next pixel. This is why the masked context should match the context that is available at runtime. (i.e.,y_ctx needs to be the same during training and at runtime.)

The second time, it is directed to GCLC by "func = getattr(self.latent_codec["y"], "entropy_parameters", lambda x: x)." Was entropy_parameters actually used here? Because in Cheng2020AnchorCheckerboard instantiation, entropy_parameters seem to be provided only at the CheckerboardLatentCodec layer, and the instantiation of "y": GaussianConditionalLatentCodec(quantizer="ste"), does not provide entropy_parameters, which defaults to x: x?

That particular entropy_parameters is identity, as you correctly identified. I'll probably remove that line of code since it commonly causes confusion.

Where does the entire context model start and end? For example, in mbt2018, can ctx_hat=self.context_prediction(y_hat) be considered the entire context model in this architecture (i.e., the process from input y_hat to output ctx_hat)? Then how should I understand the scope of the entire context model in Cheng2020AnchorCheckerboard? Does it include EP, and even the whole CheckerboardLatentCodec, including GCLC?

People seem to use the terminology differently. I've seen people call "self.context_prediction" to be the context model. On the other hand, MLIC seems to think of it as the whole process used for decoding $y$ in multiple steps, where elements of $y$ are decoded using the help of the available context (previously decoded elements of $y$). I prefer this second interpretation slightly.

I wonder if you could draw a simple schematic diagram of the Cheng2020AnchorCheckerboard and even Elic2022Official model architectures? (I find that the model structure diagrams in papers are not as clear and straightforward as those in google.py).

The overall architecture components should be fairly similar to JointAutoregressiveHierarchicalPriors.

Generic "autoregressive" architecture

The internals are mostly the same, too, except for the use of a different context and decoding order / number of decoding steps:

Checkerboard context model internals

ELIC is just a "conditional channel group" context model, where each channel group is predicted using (a) previously decoded channel groups, (b) the checkerboard spatial context, and (c) the "hyperprior" context $h_s(\hat{z})$.

Conditional channel groups
formioq commented 3 months ago

Thank you very much for your response. I don't have any more questions for now, but if I come up with new ones later on, I will open another issue. :)