Hi, I am currently using one A100 40 doing test on lumina-t2v model, may I ask the gpu type used for training the T2V model. And I also wonder the number of frames?
My implementation follows these steps:
I followed the paper, added another flatten and unflatten operations along the frame dimension.
In order to save time, I did the preprocessing separatedly before starting training, including llama and vae. But the vae is identical to the one used in t2i, so I worry it might not be able to capture enough temporal consistency.
In my testing, the video tensor stops at b=4,f=8,c=4,h=32,w=32 (after embedding) out of the memory issue. So it might be sort of impossible to even do the small-scale tests to verify your temporal-spatial merging method.
I am really interested in reading your training details, and the comparison between temporal-spatial dividing and merging strtegies. Your insights would be greatly helpful.
Hi, I am currently using one A100 40 doing test on lumina-t2v model, may I ask the gpu type used for training the T2V model. And I also wonder the number of frames?
My implementation follows these steps:
In my testing, the video tensor stops at b=4,f=8,c=4,h=32,w=32 (after embedding) out of the memory issue. So it might be sort of impossible to even do the small-scale tests to verify your temporal-spatial merging method.
I am really interested in reading your training details, and the comparison between temporal-spatial dividing and merging strtegies. Your insights would be greatly helpful.