OpenDriveLab / Vista

[NeurIPS 2024] A Generalizable World Model for Autonomous Driving
https://opendrivelab.com/Vista
Apache License 2.0
558 stars 42 forks source link

Channel-wise latent prior stronger than dynamic latent priors? #24

Closed jmonas closed 1 month ago

jmonas commented 3 months ago

Instead of repeating and concatenating the initial frame to each latent, did you attempt to instead concatenate the final frame of the dynamic priors? It seems the channel-wise latent prior is much stronger than the replaced latent priors. Maybe this effect lessens with more training time?

Little-Podi commented 3 months ago

Any questions are welcome!

Instead of repeating and concatenating the initial frame to each latent, did you attempt to instead concatenate the final frame of the dynamic priors?

Yes, I tried what you suggested while developing this model, but I finally chose to use the first frame because we do not always have three frames as conditions. For example, users are likely to provide only one image as the starting frame, and the final frame is not available in this case.

Also, if you want to dynamically set the final frame as the condition, I think sending some prompts to the model about the index of the condition frame would be helpful. However, I think the condition frames with varying indices may lead to higher training instability compared to using a fixed frame.

It seems the channel-wise latent prior is much stronger than the replaced latent priors. Maybe this effect lessens with more training time?

This is because without finetuning, the original SVD model hasn't learned how to incorporate the latent priors. The ability to leverage these priors will gradually increase during training.