Question about latent size

PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

MIT License

11.23k stars 999 forks source link

Question about latent size #93

Open Birdylx opened 6 months ago

Birdylx commented 6 months ago

hi, this project use VQVAE to compress video into small latent space, and latent embedding dim is 512 or 256. But in LDM, they usually use very small embedding dim 4 or 3, SD use 4. Will this large latent dim make the diffusion training process too hard to learn, since it predict a high dim noise?

LinB203 commented 6 months ago

For image generation, the generation of images based on VQVAE's stable diffusion is not inferior to VAE. Therefore, we believe that Video VQVAE is capable of working with Video Diffusion. In addition, the current codebase is mainly composed of latte, which uses image vae. If we look at a frame of image, it can be reconstructed very well, but for videos, the differences between frames will be magnified. We show this in 256×256.

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/70fdb7ab-4cc6-4fb6-b928-45f6bb2b5efb

Birdylx commented 6 months ago

@LinB203 thanks for your quick reply. Yes, the Image VAE like SD's AutoEncoderKL can reconstruct image very well, but it can't ensure the consistent between frames, this is why we use Video VAE. But my question is, why using this large embedding dim in latent? In LDM, the embedding dim of latent is very small, typically 32 x 32 x 4, 4 channels here. No matter VQ or VAE, they usually use small embedding dim. This project use VQ with 256 or 512 embedding dim (not spatial size) of latent, which is very large. And the diffusion model needs to predict a high embed dim noise, is that reasonable?

LinB203 commented 6 months ago

I completely agree with you that in our testing, the quality of image-based VAE is much higher than that of Video-VQVAE. I think we urgently need to train a low dimensional Video-AE, whether it is VAE or VQVAE.

Add this to todo list.