Training details about the t2v model.

Hi, I am currently using one A100 40 doing test on lumina-t2v model, may I ask the gpu type used for training the T2V model. And I also wonder the number of frames?

My implementation follows these steps:

I followed the paper, added another flatten and unflatten operations along the frame dimension.
In order to save time, I did the preprocessing separatedly before starting training, including llama and vae. But the vae is identical to the one used in t2i, so I worry it might not be able to capture enough temporal consistency.

In my testing, the video tensor stops at b=4,f=8,c=4,h=32,w=32 (after embedding) out of the memory issue. So it might be sort of impossible to even do the small-scale tests to verify your temporal-spatial merging method.

I am really interested in reading your training details, and the comparison between temporal-spatial dividing and merging strtegies. Your insights would be greatly helpful.

Alpha-VLLM / Lumina-T2X

Training details about the t2v model. #63