Open Darius-H opened 2 months ago
I found that the VAE merges the frame dimension with the batch dimension, which means there is no interaction between frames when encoding video latents. It works equivalently to image VAE, which is not in line with section 3.3.1 of the paper.
https://github.com/Vchitect/Latte/blob/c456dff74150e5b0db305fdd86b6f6de155c7634/train.py#L207
Is it because subsequent experiments have found that frame-to-frame interactions do not enhance video generation?
Hi, thanks for your interest. What is referred to in Section 3.3.1 is not the compression of video in the temporal dimension at the vae encoder stage. Instead, it refers to compression in the temporal dimension on the latents of the video frames.
I found that the VAE merges the frame dimension with the batch dimension, which means there is no interaction between frames when encoding video latents. It works equivalently to image VAE, which is not in line with section 3.3.1 of the paper. https://github.com/Vchitect/Latte/blob/c456dff74150e5b0db305fdd86b6f6de155c7634/train.py#L207
Is it because subsequent experiments have found that frame-to-frame interactions do not enhance video generation?