CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
11.78k stars 1.53k forks source link

Embedding dimensions of LDM-VQ models are different wrt to VQGAN's #144

Open joanrod opened 2 years ago

joanrod commented 2 years ago

I realize that the configuration of VQ autoencoders in Latent Diffusion is different than the one used in VQGAN (taming-transformers). Specifically, I see that embed_dim and z_channels have low values (3, 4, ...) in Latent Diffusion (https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/models/first_stage_models/vq-f8/config.yaml#L5) whereas in VQGAN the values were larger (256, 512) (https://github.com/CompVis/taming-transformers/blob/24268930bf1dce879235a7fddd0b2355b84d7ea6/configs/imagenet_vqgan.yaml#L5)

TL;DR, what is the reason that the Z embedding dimension is lower in Latent Diffusion? Thanks!

zhuyu-cs commented 1 year ago

The same question!!! It's tooooooo strange and I can't figure out why this is also suitable for the mentioned tasks, especially SR task. Actually, I just want to know if this is the result of experience or if there is some motivation.

wtliao commented 1 year ago

Hi @joanrod and @zhuyu-cs , I have the same confusion. Have you figured it out? Thanks!

ima9ine commented 1 year ago

@joanrod @zhuyu-cs @wtliao
The reason why LDA have very small embed_dim than VQGAN is that LDA use diffusion model as a generative process. Unlike VQGAN which use encoded vectors as a building block, LDA use diffusion process that can generalize generative process by only 'few channels'. VQGAN's encoded vectors have large dimension but VQGAN downsample more than LDA. So the number of encoded output is much smaller than LDA's encoded output. Since using diffusion model for large image size takes a lot of computation, the main idea of LDA is that downsample image until encoded image have perceptual equivalence with original image and use diffusion model at low resolution(=embeded image). I don't know exact reason why LDA use embed_dim as 4, not 3. But I guess it is pure hyperparameter of model.

check out this video : https://www.youtube.com/watch?v=844LY0vYQhc You can see that encoded images look similar with original input but they have different points for generation model.

스크린샷 2023-02-01 15-50-11 스크린샷 2023-02-01 16-18-59

wtliao commented 1 year ago

Hi @ima9ine ,

thanks a lot for your detail explanation! It helps me better understand. One question, do you mean the "LDA" you used "LDM"? Further, for generating image in resolution of 1024X1024, is the input of the diffusion model, i.e. the output of the encoder, is still the size of 64643? Thanks again!