YangLing0818 / SGDiff

Official implementation for "Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training" https://arxiv.org/abs/2211.11138
60 stars 10 forks source link

About Variational Autoencoder Pre-Training #15

Open Qi-Chuan opened 7 months ago

Qi-Chuan commented 7 months ago

Dear authors, Thanks for the excellent work! About VQVAE for embedding image to latent, you have provided the pre-trained model. Could you please share the training and testing code and instruction for the VQVAE part? In my understanding, the VQVAE part determines the upper bound of the image generation quality. It would be very helpful for me to train the whole model from scratch. Thanks a lot!

Qi-Chuan commented 7 months ago

Also, I have a question about the VQVAE module. I found that in latent diffusion model training, the parameters of the decoder part of VQVAE (decoding the latent vector to image) are fixed and not updated. After denoising in latent space by ldm, the resulting latent vector needs to go through the decoder part of VQVAE to produce an RGB image. If the pre-trained VQVAE decoder part does not go through finetune on the VG dataset, what the decoder learned is still the latent-to-image mapping of the original OpenImages dataset, i.e., the raw pre-training performance. So I think this will limit the final image generation performance. Could you please explain it? Thanks a lot!