some question about pixel generator

Thanks for your interest. Please note that Fig 3(b) is to illustrate the pixel generator's training phase. Most current generative frameworks, such as MAGE and LDM, either partially mask or add noise to the original image, and ask the model to reconstruct the original image during training. In FIg 3(b), we take MAGE as an example, which first tokenizes the image into image tokens and then masks some of the tokens. Therefore, the original image is needed as the input of the training phase. However, we do not need the original image during generation -- generation starts from a 100% masked image (MAGE), or Gaussian noise (LDM/ADM), conditioned on only the representation.

LTH14 / rcg

some question about pixel generator #1