I notice your model works on BAIR and Cityscapes, but they both use 2 conditional frames to predict 6 future frames. And in your code, you only support selecting 2 frames for relative video tasks.
As for other common video prediction datasets like KTH/SMMNIST, they usually use 5/10 conditional frames and the model predicts the next 5/10 frames. Can you model solve them? If I fork your code and do prediction tasks on KTH/SMMNIST, can you give some suggestions for code modifications? Thanks!
I notice your model works on BAIR and Cityscapes, but they both use 2 conditional frames to predict 6 future frames. And in your code, you only support selecting 2 frames for relative video tasks.
https://github.com/exisas/LGC-VD/blob/2c691d75a0de92f9609a50fd1f61f7d2d0fa62d2/video_diffusion_pytorch/video_diffusion_pytorch.py#L349-L360
As for other common video prediction datasets like KTH/SMMNIST, they usually use 5/10 conditional frames and the model predicts the next 5/10 frames. Can you model solve them? If I fork your code and do prediction tasks on KTH/SMMNIST, can you give some suggestions for code modifications? Thanks!