CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
11.61k stars 1.51k forks source link

Why use VAE or VQ-GAN/VAE, not AE? #342

Open Luh1124 opened 8 months ago

Luh1124 commented 8 months ago

Why does the Latent Diffusion Model use variational autoencoders (VAE) or similar generative models like VQ-GAN/VAE for compression instead of using AutoEncoder (AE)? If AE can be considered a one-to-one mapping for discrete images, wouldn't training in AE's latent space be consistent with training in the pixel space? What role does the continuous or discrete latent space distribution, which supports sampling in VAE or VQ-GAN/VAE, play?

iris0329 commented 8 months ago

I think it is because VQ-GAN is proposed in the paper Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: Designing a Better Asymmetric VQGAN for StableDiffusion by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua, which is published later than this paper.

789as-syl commented 6 months ago

请问一下,潜在扩散模型到底是用VAE,还是用了VQ-GAN,实际情况下,哪一个更有优势呢?

liutaocode commented 6 months ago

I am also searching for this answer. I guess that the continuous space is more smooth and controllable than the VQ-based space.

===update===== After reviewing the paper, here are my comments:

KL-regularized models may demonstrate better interpolation ability and diversity of generated samples in some tasks, but may require more meticulous tuning during training to balance the quality of reconstruction and the smoothness of the latent space. VQ-regularized models may exhibit higher stability and consistency in generative tasks, suitable for applications that demand high output quality, but this stability might come at the cost of sacrificing some sample diversity.

Luh1124 commented 6 months ago

请问一下,潜在扩散模型到底是用VAE,还是用了VQ-GAN,实际情况下,哪一个更有优势呢?

ldm 的 readme 中展示的实验结果是 vae 可以用更少的 steps 达到更低的 FID