AILab-CVC / CV-VAE

[NeurIPS 2024] CV-VAE: A Compatible Video VAE for Latent Generative Video Models
https://ailab-cvc.github.io/cvvae/index.html
246 stars 8 forks source link

About training Resolution #7

Open lxd941213 opened 5 months ago

lxd941213 commented 5 months ago

Hi, great work! I would like to ask some details about the CA-VAE training. I saw in your paper that CA-VAE trained in “9 × 256 × 256 and 17 × 192 × 192”. If it is trained at such a low resolution, will the quality be worse if it is inferred at 512 or 768 resolution? Looking forward to your reply, thank you! 微信图片_20240618154112

ryancll commented 5 months ago

I've tested the CV-VAE on high-resolution video data and the reconstruction quality is not as good as 2D VAE, especially for some high frequency details like small human face. @sijeh Do you have any plan to release a high-resolution version? If not, can we direcly finetune the model with high-resolution data? (Network capacity releated expriment results will be very instructive to the community). Thank you!

Tord-Zhang commented 5 months ago

I've tested the CV-VAE on high-resolution video data and the reconstruction quality is not as good as 2D VAE, especially for some high frequency details like small human face. @sijeh Do you have any plan to release a high-resolution version? If not, can we direcly finetune the model with high-resolution data? (Network capacity releated expriment results will be very instructive to the community). Thank you!

I have also tested CV-VAE and tried finetuning my UNET on it, while it can keep better temporal consistency, the detail is rather worse compared to 2D VAE.

sijeh commented 5 months ago

256x256 is sufficient for training VAE, since VAE of SD2.1 is also trained at this resolution. The loss of VAE in high-frequency information (such as fine textures and intense motion) is mainly due to the use of 4 channels in the latent (z=4). 3D VAE has a higher compression ratio compared to 2D VAE, resulting in greater information loss. We are also currently training the SD3 version of CV-VAE. Since SD3's latent uses 16 channels, it has a significant improvement (With the same setting, 31.9dB V.S 28.9dB in PSNR, 0.928 V.S 0.885 in SSIM)compared to the VAE with z=4.

sijeh commented 5 months ago

I've tested the CV-VAE on high-resolution video data and the reconstruction quality is not as good as 2D VAE, especially for some high frequency details like small human face. @sijeh Do you have any plan to release a high-resolution version? If not, can we direcly finetune the model with high-resolution data? (Network capacity releated expriment results will be very instructive to the community). Thank you!

Fine-tuning at high resolutions cannot solve this problem. We have already tried further fine-tuning at 320x320x17, but the reconstruction performance cannot be effectively improved. The reconstruction loss mainly comes from the z=4 latent used in SD2.1's VAE, and the 3D VAE has a 4x higher information compression ratio than the 2D VAE. Using a z=16 3D VAE will achieve a significant improvement.

ryancll commented 5 months ago

@sijeh Thank you! Very useful information!

radna0 commented 4 months ago

Is it possible to get access to the z=16 SD3 version of CV-VAE? @sijeh