Frame Resolution and Autoencoder Generalization

basilevh / gcd

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis (ECCV 2024 Oral) - Official Implementation

https://gcd.cs.columbia.edu/

GNU General Public License v3.0

183 stars 3 forks source link

Frame Resolution and Autoencoder Generalization #4

Closed ttaosci closed 1 month ago

ttaosci commented 1 month ago

Hi, this is truly impressive work!

I just have a quick question. In your implementation, I noticed that the frame size is 384x256, which differs from the original SVD input size of 1024x576. Additionally, it seems that the autoencoder components (image encoder and video decoder) were not fine-tuned.

Do these two components from SVD generalize well to different resolutions? If so, are there any related papers exploring this, or have you run experiments on this and found it works well?

Thank you very much!

basilevh commented 1 month ago

Hi, thanks for your question! SVD by itself (image2video) without further training or finetuning is not very robust to resolution changes, for example weird artifacts will appear when you run it at a size of either much smaller or larger than 1024 x 576. However, once you start finetuning at any resolution, for example 384 x 256 in our case, the model very quickly adapts (after a low number of training iterations) to that new mode and the artefacts disappear. You’re right that the VAE components are frozen which to me intuitively suggests that the U-Net is primarily responsible for resolution related issues. The VAE operation is more local/low level since it connects pixels to embeddings that are spatially close. Hope this helps!

ttaosci commented 1 month ago

I see, you’re suggesting that the artifact when changing the resolution for image-to-video SVD comes from the U-Net component.

Have you tried using the image encoder to encode a video into latent frames, then using the video decoder to reconstruct it at 384x256 resolution without involving the U-Net? If your suggestion is correct, this approach should reconstruct the video perfectly!

Thank you again for your quick response—it’s been extremely helpful.