LTH14 / rcg

PyTorch implementation of RCG https://arxiv.org/abs/2312.03701
MIT License
846 stars 40 forks source link

Comparison with self-conditioning proposed in Analog Bits, and basic two pass sampling baselines #2

Open YouJiacheng opened 11 months ago

YouJiacheng commented 11 months ago

Dear authors: Thank you for opensourcing your great work RCG.

However, I have noticed that:

  1. A closely related technique called self-conditioning (NO clustering is performed, in contrast to 3, 34, 40) has been proposed in Analog Bits, https://arxiv.org/abs/2208.04202. This technique leverage x0 prediction of previous denoising step as condition and greatly improve performance of Analog Bits. Recent works have shown that this technique is also effective for continuous generation. (Note: Analog Bits use continuous state space and discretize the final sample.) This technique is compatible with parallel decoding methods as well. Parallel decoding methods will predict all masked tokens in each step while only accepting those with top confidence. However, all predicted tokens can be use as condition for the next prediction step.

  2. All baseline methods use only one sampling pass, while RCG use two sampling passes. This may potentially cause unfair comparison. It is well known that diffusion models can achieve higher generation quality by using two sampling pass (first denoise gaussian noise into an intermediate result, then add noise of proper scale to it and finally denoise it again), even without any specific training. Thus, a naive baseline can be construct as described above.

Moreover, it is straight forward to design a two pass sampling aware algorithm, i.e. the first pass generate an intermediate result (optionally stop grad, and optionally use a frozen model, and optionally use partial forward process/masking plus one step denoising/reconstructing), the second pass use the encoded intermediate result as condition. Of course this naive design might be inefficient to train. Fortunately, self-conditioning is fully compatible with two pass sampling.

Would you like to include more comparison and discussion on these aspects?

Thank you for any help you can offer.

philippe-eecs commented 11 months ago

The paper is also missing a cite for DiffAE, which uses a similar method to condition the model on a representation. Trains another diffusion model to sample this representation. Then samples this representation first before conditioning the diffusion model.

https://arxiv.org/abs/2111.15640

Otherwise, this is a really interesting paper! I think sampling from representation space will be an interesting way in the future for sampling diffusion models.

LTH14 commented 11 months ago

Thanks for your interest and thanks for pointing out this work!

  1. The "self-conditioning" in Analog Bits is a bit different from the "self-conditioning" in RCG: Analog Bits conditions on a high-dimensional x0 image, while RCG conditions on a low-dimensional representation. The purpose of self-conditioning on a representation is to provide high-level guidance about the image to generate.
  2. The first sampling pass of RCG (RDM) is on low-dimensional representation space, which adds only marginal computational overhead to the entire generation process as shown in Table 7 of the paper. The goal of RDM is to generate a representation instead of an intermediate pixel generation result to condition on. Therefore, it does not contradict the two-pass sampling algorithm you propose: the two passes can both be conditioned on the low-dimensional representation and potentially improve the performance. It sounds very promising and we leave it for future exploration.
YouJiacheng commented 11 months ago

Thanks for your reply @LTH14 . For self-conditioning in analog bits, it definitely differs from self-conditioning in RCG. But the insight is similar.

Moreover, as @philippe-eecs mentioned, DiffAE is almost identical, except that it doesn't emphasize unconditional case, and its representation might have higher dimension.

Screenshot_2023-12-08-08-05-16-457_com.microsoft.emmx-edit.jpg

For two pass sampling, I agree that sampling the representations has very low cost. Here I want to emphasize that two passes generative models are intrinsically different from one pass generative models, for example latent dirichlet allocation versus simple bag of words. In addition, there is a large difference between a paradigm innovation and a cost optimization.

LTH14 commented 11 months ago

@philippe-eecs @YouJiacheng Thank you for pointing out the DiffAE work -- we will include it in our related works. Besides emphasizing unconditional image generation, one noticeable difference is that RCG uses an image encoder pre-trained with SSL methods (e.g. Moco v3) to encode the image into a low-dimensional latent space. Therefore, RCG's design allows seamless integration with different SSL image encoders and different pixel generators, as shown in Table 4(a) and Table 6(b).

@YouJiacheng I fully agree that multi-pass generative models are different from single-pass generative models (e.g., diffusion models are multi-pass, compared with GAN). Inference costs are always a downside of multi-pass generative models. However, we want to note that no prior works, multi-pass nor single-pass, have demonstrated unconditional image generation performance that is competitive with conditional image generation. In this paper, we want to highlight our innovation in generating SSL representations, which proves to be an effective and efficient way to boost unconditional image generation performance, and can be seamlessly integrated with state-of-the-art SSL and image generation frameworks.

YouJiacheng commented 11 months ago

Agree. The most impressive part of RCG (to me) is that it can make unconditional generation quality comparable to conditional generation.

BTW, DiffAE itself is a SSL method IIUC. Thus the key difference is that RCG compatible to different SSL methods, and use a (frozen?) pretrained encoder. (And is not restricted to diffusion models.)

LTH14 commented 11 months ago

Yes, the major message we want to convey in the paper is bridging the gap between the unconditional and conditional generation, so that generative models could benefit from pre-training on large-scale unlabeled image datasets. I feel the DiffAE method is a bit restricted by its design (mainly in its shared image encoder and diffusion encoder). RCG is a more general framework that does not have such restrictions on image encoders or generative models.

YouJiacheng commented 11 months ago

It seems that DiffAE doesn't share image (representation) encoder and "diffusion encoder". In section 3.2 of DiffAE said

We do not assume any particular architecture for this encoder; however, in our experiments, this encoder shares the same architecture as the first half of our UNet decoder

Here UNet decoder refers to its diffusion-based decoder. If there is a "diffusion decoder", it should be the first half of UNet.

"Stochastic encoder" of DiffAE is merely to solve the ODE in the inverse direction, comparing to "diffusion decoder", with the same neural network. There is no counterpart of "stochastic encoder" in RCG framework.

Thus, only architecture is shared, and this is only an experiment setting, not the constraint posed by the framework.

LTH14 commented 11 months ago

Thanks for the clarification. By "diffusion encoder" I mean the encoding part of the diffusion UNet. From my understanding, they co-train the semantic encoder with the diffusion UNet, which might not get good enough representations, and that's possibly why they do not get very good unconditional generation results.

miganchuanbo commented 11 months ago

I am new to this ''self-condition''. As @YouJiacheng mentioned, RCG use two sampling passes (Representation Generator and Pixel Generator). It confused me that since RCG can map an image to its representation through SSL encoder, why it needs another generator (RDM) to gain the representation.

LTH14 commented 11 months ago

@miganchuanbo During generation, we don't have ground truth images to extract the representation. Therefore, we need to generate representations using an RDM.

pixeli99 commented 11 months ago

hi, @LTH14, since I'm new to this field, I have a beginner's question. Can I understand unconditional generation to be the pipeline in the diagram below without the Rep. Dist.? Does that mean if I first unconditionally sample a Rep. Dist. and then incorporate the Rep. Dist. into the pixel generation process, it would cancel out the gap between unconditional and conditional generation? I'm not sure if this understanding is reasonable.

image
LTH14 commented 11 months ago

@pixeli99 Yes, that's how traditional unconditional generation (ADM, LDM, MAGE) is performed -- it does not condition on anything other than noise (either Gaussian noise, or masking). The intuition behind RCG is exactly what you mentioned: we hope the image representation can serve as the human-annotated conditioning to guide the pixel generator, which could bridge the gap between unconditional and conditional generation.

pixeli99 commented 11 months ago

Thank you very much for your response,

I have another question concerning whether the current unconditional image generation models are unable to perform an implicit denoising of a Rep. Dist. before proceeding further to generate pixel images.

In other words, can't steps b and c be incorporated into a single black box model? What I'm asking is whether it's possible for a model to internally achieve the effect of b+c? ( I'm not sure if this question is silly, for instance, if it goes against some fundamental principles that I'm not aware of)

image
LTH14 commented 11 months ago

@pixeli99 This is a good question. I believe that all previous unconditional generation models kind of internally achieve the effect of b+c. In the literature, MAGE, compared with LDM and ADM, achieves much better unconditional image generation performance (without RDM). We believe the reason is because it can learn a much better representation, as shown in the MAGE paper.

However, in deep learning, it is common to achieve much better performance if we can explicitly design a module (e.g. ResNet explicitly designs the residual connection) instead of treating the neural network as an entire black box. From this observation, we decided to explicitly provide the pixel generator with a high-level representation to guide the generation, so that the pixel generator does not need to spend its capacity for both understanding the image and generating the pixels.

YouJiacheng commented 11 months ago

Hi @LTH14 , I realized that RCG can be viewed as a natural (but nontrivial) generalization to Diffusion Autoencoders and Latent Diffusion Models.

The components of RCG, DiffAE and LDM are the same. They are all consist of:

  1. a representation/latent encoder that defines the representation/latent space (and distribution).
  2. a generative model that models the representation/latent distribution (and generates representations/latents)
  3. a representation/latent to pixel/image conditional-generator/decoder

LDM uses the deterministic autoencoder architecture and reconstruction objective ($\hat{x}=D(E(x))$) that couples and co-trains 1 and 2 DiffAE uses the stochastic autoencoder architecture and reconstruction objective ($\hat{x}=D(\epsilon\sim \mathcal{N}, c=E(x)),$) that couples and co-trains 1 and 2. RCG completely decouples 1 and 2, open up a new way of designing generative models. This significant and innovative contribution enlarges the design space of generative models by adding a new dimension to it.

Interestingly, MAGE can be formulated as an instance of a variant of this framework. We can treat VQGAN-Tokenizer plus MAGE-Encoder as 1, "part" of MAGE-Decoder as 2, and "part" of MAGE-Decoder plus VQGAN-Detokenizer as 3. Note that MAGE-Decoder also models some distribution through random masking and de-masking. However, MAGE-Decoder generates image tokens (i.e., in VQGAN's latent space) instead of encoded tokens (i.e., in MAGE-Encoder's latent space). We can imagine that MAGE-Decoder first reconstruct masked encoded tokens, then decode them into image tokens, i.e. view MAGE-Decoder as MAGE-Reconstructor plus MAGE-"Decoder".

The reconstruction-based representation learning might be the root cause of sub-optimal unconditional generation quality:

  1. The reconstruction objective naturally implies a symmetric architecture since it can improve reconstruction quality. However, improving reconstruction does not imply improving generation (explicitly pointed out and verified by MAGVIT-v2). Actually, LDM has mentioned "Interestingly, we find that LDMs trained in VQ-regularized latent spaces sometimes achieve better sample quality, even though the reconstruction capabilities of VQ-regularized first stage models slightly fall behind those of their continuous counterparts". The ablation experiments in the RCG paper also hint that a symmetric architecture may hurt the generation quality. Table 4(c) shows that a projection dimension of 768 leads to a significantly worse generation performance than 256 and 128. LDM-4 has a latent of the shape 64*64*3=12288, which is significantly larger than 768, let alone 256 and 128.

  2. A representation learned from reconstruction can be largely suboptimal. While there might be a near-optimal reconstruction-based representation learning method that can match the optimal performance of all SSL methods ($A \subset B \implies \sup f(A) \leq \sup f(B)$, but the $=$ might be satisfied for a given $f$), coupled representation learning and conditional pixel distribution modeling (coupled encoder and decoder) make it hard if not impossible to search for a near-optimal reconstruction-based representation learning setup. Table 4(a) shows that even in the contrastive learning family, the performance gap between different representation learning methods can be huge (I suspect that most MIM methods will perform worse).

LTH14 commented 11 months ago

@YouJiacheng Thanks for your insightful comments. Another difference between LDM and RCG is that LDM uses a deterministic decoder to decode the latent space, while RCG uses a pixel generator to generate pixels from the latent. The "latent" is a bit different in LDM and RCG: LDM's latent refers to 16x16 representation space, while RCG's latent refers to a representation vector without spatial dimension.

From my understanding, reconstruction-based representation learning is good in "fine-tuning" performance instead of "linear probing" performance, which means that the representation it learns is not very compact and regularized. We also tried some experiments with MAE as the image encoder, but the performance is not as good as contrastive learning-based methods. One direction that might be interesting to explore is to use contrastive+reconstructive pre-trained encoder, such as CMAE or MAGE-C -- maybe its feature can provide a even better guidance.

YouJiacheng commented 11 months ago

Totally agree, deterministic decoder versus stochastic generator can make a large difference. LDM uses a deterministic decoder, DiffAE uses a stochastic generator based on gaussian diffusion, and RCG(+MAGE) uses a stochastic generator based on random masking. BTW, LDM-4's latent is 64*64*3 instead of 16*16*c according to the original paper.

Thanks again for your very helpful discussion. At the beginning I largely underestimated the significance and novelty of this work. Now I have a much better understanding of RCG.

LTH14 commented 11 months ago

Thanks for the insightful discussion! We hope this paper could open the door for a more general design choice for generative models, which does not rely on labeled data and could be extended to other modalities. We've already updated the Arxiv version according to our discussion and the new version should be out on Sunday.

kayson-tihoi commented 11 months ago

Are compared methods randomly picked? :-) In the paper with code benchmark image image 2,8(ICCV23),9(NeurIPS23),22/23 are with ViT backbones.

LTH14 commented 11 months ago

@kayson-tihoi We report representative and state-of-the-art methods in the literature for our comparison. Some of the 1-9 papers are under peer review (1, 2, 3, 6, 7), so we include 4 in our comparison as the state-of-the-art. In Table 1, we compare RCG with state-of-the-art class-conditional and class-unconditional methods without guidance, showing RCG achieving superior performance. For Table 2, we compare RCG with representative (and also state-of-the-art) class-conditional methods with guidance, including MDT, DiT, ADM, LDM, showing RCG achieves comparable results.