Vchitect / Latte

Latte: Latent Diffusion Transformer for Video Generation.
Apache License 2.0
1.44k stars 147 forks source link

About video VAE #73

Open Darius-H opened 2 months ago

Darius-H commented 2 months ago

I found that the VAE merges the frame dimension with the batch dimension, which means there is no interaction between frames when encoding video latents. It works equivalently to image VAE, which is not in line with section 3.3.1 of the paper. https://github.com/Vchitect/Latte/blob/c456dff74150e5b0db305fdd86b6f6de155c7634/train.py#L207

Is it because subsequent experiments have found that frame-to-frame interactions do not enhance video generation?

maxin-cn commented 2 months ago

I found that the VAE merges the frame dimension with the batch dimension, which means there is no interaction between frames when encoding video latents. It works equivalently to image VAE, which is not in line with section 3.3.1 of the paper.

https://github.com/Vchitect/Latte/blob/c456dff74150e5b0db305fdd86b6f6de155c7634/train.py#L207

Is it because subsequent experiments have found that frame-to-frame interactions do not enhance video generation?

Hi, thanks for your interest. What is referred to in Section 3.3.1 is not the compression of video in the temporal dimension at the vae encoder stage. Instead, it refers to compression in the temporal dimension on the latents of the video frames.