Some questions about the VQGAN tokenizer

LooperXX commented 1 year ago

First, congratulations to MAGE for being accepted by CVPR2023! I learned a lot from your great paper and also the detailed replies to other issues!

I'm not familiar with the usage of the image tokenizer. Below are some questions from my side:

Based on this issue the image tokenizer pre-trained with two different (in image augmentation) settings based on the VQGAN tokenizer in the MaskGiT. Hence, can I think the image tokenizer is first pre-trained and then plug into the MAE-like encoder-decoder architecture?
In your paper, only the reconstruction cross-entropy loss and contrastive loss are used. Hence, is the image tokenizer fixed (frozen) in the training-from-scratch process of MAGE?

Thanks again for such a great paper! It pushes the field a big step forward!

LTH14 commented 1 year ago

Hi! Thanks for your interest. The answers to both of your questions are Yes: 1. One key observation in MAGE is that the exact same MAE-like encoder-decoder operating on image tokens instead of pixel space (MAE) gives us a huge boost in both generation and representation learning performance. 2. The image tokenizer is pre-trained and fixed during the MAGE training.

LooperXX commented 1 year ago

Thanks for your prompt reply! Wish you all the best! I am very curious that why BEiT-like paper use the patch projection tokens as input instead of their tokenized semantic tokens before I read your paper. I am not sure if MAGE is the first paper which use semantic tokens as input for representation learning, but the experimental results about linear probing prove its effectiveness!

I also very like the analysis and details in Table. 6. As you explained in this issue, I am a little confused about why "the receptive fields of neighboring feature pixels have significant overlap, it is much easier to infer masked feature pixels using nearby unquantized feature pixels.". Since you have masked the raw pixels, the receptive fields of both the neighbouring feature tokens and semantic tokens should be very similar?

LTH14 commented 1 year ago

I think there is some misunderstanding: we do not mask the raw pixels in 256x256 images, but we mask the pixels in the feature space (16x16). Therefore, if we do not quantize the features and directly mask them, the masked features can easily be inferred by looking at nearby features, as each pixel in the feature space has an overlapping receptive field.

LooperXX commented 1 year ago

Thanks for solving my confusion! The analysis in your paper, the discussion of this issue, and the previous issue remind me that: If we directly mask CNN encoded image features, the overlapping receptive field will cause information leaking. Except for using quantized tokens in MAGE, I recommend you can try the ViT-VQGAN as they said in Section 3.1: "The encoder of ViT-VQGAN first maps 8×8 non-overlapping image patches into image tokens". Although it is still not open-source, it is the only Transformer-based I know and the authors come from Google Research, maybe you can try MAGE with ViT-VQGAN. Since the patches are non-overlapping, it may better handle this problem and improve the results in Table. 4.

LTH14 commented 1 year ago

Thanks for your suggestion! It could be an interesting future direction to explore.

LooperXX commented 1 year ago

Hi @LTH14, I noticed that in mage/models_mage.py, after tokenizing the input images into input ids (based on the VQGAN's codebook), the ids will be converted to embeddings based on self.token_emb. However, based on your paper (if I am not mistaken):

it seems that the self.token_emb should be initialized with VQGAN's codebook?
self.token_emb should be fixed during MAGE training? (considering you said, "The image tokenizer is pre-trained and fixed during the MAGE training." here, which means encoder & decoder & codebook of VQGAN are all fixed?)

LTH14 commented 1 year ago

Hi, I think there is a bit misunderstanding: we use the pre-trained VQGAN to extract the token "index", instead of the token vector from the original image. After that, we use a different embedding (i.e., self.token_emb) to embed such token index, which is trained together with the MAGE training. The main reason to do this is that the embedding dimension of the MAGE transformer can be different from the VQGAN codebook. As a result, directly using the VQGAN codebook as an initialization of self.token_emb can cause embedding dimension mismatch.

LooperXX commented 1 year ago

Thanks for your prompt reply! Hence, the frozen VQGAN is just used to extract image token index and decode the index to an image (or we can say tokenize and detokenize). The random initialized self. token_emb is learned during training and can be seen as a new learned MAGE codebook aligned with VQGAN codebook (so MAGE can use VQGAN decoder). Is there any misunderstanding in my above understanding?🧐 BTW, do you try a VQGAN codebook copy and a linear projection as the self.token_emb?

LTH14 commented 1 year ago

You are mostly correct! Just one minor point: the newly learned MAGE codebook also does not need to necessarily align with the VQGAN codebook: the output of MAGE is still token index. Hence, you can use the VQGAN detokenizer to decode those token index.

"VQGAN codebook copy and a linear projection as the self.token_emb" -- I remember I tried that one and the results are quite similar.

LooperXX commented 1 year ago

The "align" in my comment means each index in the token_emb naturally aligns with that in the VQGAN codebook based on the training process. It aligns with your comment! Thanks again for your detailed discussion!

LTH14 / mage

Some questions about the VQGAN tokenizer #9