official implementation of the paper: Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (CVPR 2023)
We didn't measure FLOPS for training.
In the case of 128-frame training, training on SkyTimelapse took 12hrs with 4 A100 GPUs, and training on Taichi and UCF-101 took about 2 weeks with 8 A100 GPUs.
Can you provide FLOPs for training? Or approximate training time with the corresponding number of GPUs.