LTH14 / mage

A PyTorch implementation of MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
MIT License
507 stars 26 forks source link

Question on image Inpainting #49

Open GGGGxxxxxxxxr opened 8 months ago

GGGGxxxxxxxxr commented 8 months ago

Dear authors,

I am super interested in your brilliant work! And I am very curious how image Inpainting has been achieved via such a framework.

Different from previous MAE, MAGE would like to do predict on VQGAN index domain. I could figure out the general idea of implementing inpainting via MAE, but with MAGE, how could you feed the masked images into VQGAN and do prediction on the generated codebook index domain?

Because mask on the raw pixel level does not that equal to the mask on the codebook index domain.

May you please explain this with more details? Thank you so much for your great contribution again!

LTH14 commented 8 months ago

Thanks for your interest! For Figure 1 in the paper, as mentioned in the caption, "the mask for MAGE is on semantic tokens whereas that of MAE is on patches in the input image." For image inpainting / outpainting / uncropping, we perform masking on the original image pixels, and after the VQGAN, we mask out the image tokens that are largely affected by the pixel masking (any token that is overlapped with the pixel masking) and perform the reconstruction in token space.

GGGGxxxxxxxxr commented 8 months ago

I got your idea Boss.

But referring to what you have mentioned: "mask out the image tokens that are largely affected by the pixel masking (any token that is overlapped with the pixel masking)", as I refer to the eg., row1 in Figure1 in MAGE paper, with a high-ratio random mask on that lovely dog, after several layers of initial downsampling before being fed into codebook, it is likely that most of the semantic tokens could be somehow affected by the original mask of the pixel level, how would you choose what exact semantic token you would like to keep for restoration?

Thanks again for your patience bro!

LTH14 commented 8 months ago

For Figure 1 in the paper, as mentioned in the caption, "the mask for MAGE is on semantic tokens whereas that of MAE is on patches in the input image."

For Figure 1's MAGE results, we DO NOT perform masking on pixels but only perform masking on tokens. Figure 1 compares MAGE with MAE on their reconstruction ability (the training task) instead of the inpainting ability. Figure 7 in the Appendix demonstrates the inpainting ability.

GGGGxxxxxxxxr commented 8 months ago

I got you.

So is that the probably the reason why you choose to use mask shapes like a square or something that is unified not random mask so that we could tell much more easily that which part of the token matrix shall be kept for encoding?

Thanks.

LTH14 commented 8 months ago

The mask of MAGE is always on tokens -- even when the original masking is on the pixels (as in the inpainting scenario), we need to transfer it into masking on tokens to perform MAGE. For each token, we randomly mask them out during training. Since the striding of VQGAN is 16, each token roughly corresponds to a patch of 16x16 pixels in the original image, thus corresponding to square shape.

GGGGxxxxxxxxr commented 8 months ago

Thanks for the detailed explanation! I am quite more confident on understanding this part then. Appreciate your time and patience!