Channel-wise latent prior stronger than dynamic latent priors?

Any questions are welcome!

Instead of repeating and concatenating the initial frame to each latent, did you attempt to instead concatenate the final frame of the dynamic priors?

Yes, I tried what you suggested while developing this model, but I finally chose to use the first frame because we do not always have three frames as conditions. For example, users are likely to provide only one image as the starting frame, and the final frame is not available in this case.

Also, if you want to dynamically set the final frame as the condition, I think sending some prompts to the model about the index of the condition frame would be helpful. However, I think the condition frames with varying indices may lead to higher training instability compared to using a fixed frame.

It seems the channel-wise latent prior is much stronger than the replaced latent priors. Maybe this effect lessens with more training time?

This is because without finetuning, the original SVD model hasn't learned how to incorporate the latent priors. The ability to leverage these priors will gradually increase during training.

OpenDriveLab / Vista

Channel-wise latent prior stronger than dynamic latent priors? #24