Open Birdylx opened 6 months ago
For image generation, the generation of images based on VQVAE's stable diffusion is not inferior to VAE. Therefore, we believe that Video VQVAE is capable of working with Video Diffusion. In addition, the current codebase is mainly composed of latte, which uses image vae. If we look at a frame of image, it can be reconstructed very well, but for videos, the differences between frames will be magnified. We show this in 256×256.
https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/70fdb7ab-4cc6-4fb6-b928-45f6bb2b5efb
@LinB203 thanks for your quick reply. Yes, the Image VAE like SD's AutoEncoderKL can reconstruct image very well, but it can't ensure the consistent between frames, this is why we use Video VAE. But my question is, why using this large embedding dim in latent? In LDM, the embedding dim of latent is very small, typically 32 x 32 x 4
, 4
channels here. No matter VQ or VAE, they usually use small embedding dim. This project use VQ with 256
or 512
embedding dim (not spatial size) of latent, which is very large. And the diffusion model needs to predict a high embed dim noise, is that reasonable?
I completely agree with you that in our testing, the quality of image-based VAE is much higher than that of Video-VQVAE. I think we urgently need to train a low dimensional Video-AE, whether it is VAE or VQVAE.
Add this to todo list.
hi, this project use VQVAE to compress video into small latent space, and latent embedding dim is
512
or256
. But in LDM, they usually use very small embedding dim4
or3
, SD use4
. Will this large latent dim make the diffusion training process too hard to learn, since it predict a high dim noise?