What will happen if CLIP image representation is used to replace SSL representation?

tanbuzheng commented 3 months ago

Hi, author! Thanks for your sharing! You do an impressive work! I have two question. The first is what will happen if CLIP image representation is used to replace SSL representation in the first two stages. The second is why not also adopt a diffusion model in the third stage? Compared with the diffusion models, what are the advantages of using mage?

Looking forward to your reply!

LTH14 commented 3 months ago

Thanks for your interest! You can definitely use CLIP image representation, or in general, any representation to replace the Moco v3 representation. In the paper, we mainly focus on the unconditional generation setting, where labels are not available. Therefore, we don't use CLIP in the paper as it uses text data to train the encoder, but it is definitely possible.

The third stage can actually be any modern image generator. In Table 1 and Figure 2, we show that RCG significantly improves all these generators, no matter MAGE or diffusion models. One advantage of MAGE is that it achieves a much better unconditional generation performance on its own (compared with diffusion models). Therefore, when combined with RCG, MAGE achieves the best unconditional generation performance among all competitors.

tanbuzheng commented 3 months ago

Thanks for your reply! I have limited computing resources, only 1-2 3090 GPUs. Does it support training the diffusion model on 256x256 resolution?
If I just want to train MAGE on imagenet1k, how long will it take?

LTH14 commented 3 months ago

The representation diffusion model can be trained on a few GPUs. However, MAGE, or the image diffusion model (DiT, LDM, ADM) needs much more -- you can refer to Table 11 in the appendix for specific training time.

LTH14 / rcg

What will happen if CLIP image representation is used to replace SSL representation? #34