AILab-CVC / CV-VAE

CV-VAE: A Compatible Video VAE for Latent Generative Video Models
https://ailab-cvc.github.io/cvvae/index.html
162 stars 3 forks source link

2D Enc + 3D Dec #2

Open ryancll opened 3 weeks ago

ryancll commented 3 weeks ago

https://github.com/AILab-CVC/CV-VAE/blob/7c69a0648a37778fa8121ab6b01a2a744449be8e/cvvae_inference_video.py#L32

Thank you for sharing the wonderful work!

According to your example code, the number of frames to encode in video mode should be 1 + 4*(N - 1). That means we have to drop several frames in some cases. Can we encode frames in image mode and decode latents in video mode to keep all frames and get temporal-interpolated outputs?

sijeh commented 3 weeks ago

https://github.com/AILab-CVC/CV-VAE/blob/7c69a0648a37778fa8121ab6b01a2a744449be8e/cvvae_inference_video.py#L32

Thank you for sharing the wonderful work!

According to your example code, the number of frames to encode in video mode should be 1 + 4*(N - 1). That means we have to drop several frames in some cases. Can we encode frames in image mode and decode latents in video mode to keep all frames and get temporal-interpolated outputs?

Apologies for the late response. We've attempted to encode the video in image mode (N frames of pixels -> N frames of latents ) and decode it using video mode (N frames of latents -> 1 + 4*(N-1) frames of pixels). Unfortunately, this does not yield a temporally interpolated output. This is because encoding in image mode treats each frame independently, meaning the resulting latents does not contain effective motion information. Therefore, the video decoded in video mode is not temporally smooth. Interestingly, the 3D Decoder can achieve some degree of interpolation for smaller motions. Please note that this interpolation effect is extremely limited, so the 3D Decoder should not be considered a frame interpolation model.