About Variational Autoencoder Pre-Training

YangLing0818 / SGDiff

Official implementation for "Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training" https://arxiv.org/abs/2211.11138

60 stars 10 forks source link

Also, I have a question about the VQVAE module. I found that in latent diffusion model training, the parameters of the decoder part of VQVAE (decoding the latent vector to image) are fixed and not updated. After denoising in latent space by ldm, the resulting latent vector needs to go through the decoder part of VQVAE to produce an RGB image. If the pre-trained VQVAE decoder part does not go through finetune on the VG dataset, what the decoder learned is still the latent-to-image mapping of the original OpenImages dataset, i.e., the raw pre-training performance. So I think this will limit the final image generation performance. Could you please explain it? Thanks a lot!

YangLing0818 / SGDiff

About Variational Autoencoder Pre-Training #15