Alpha-VLLM / Lumina-T2X

Lumina-T2X is a unified framework for Text to Any Modality Generation
MIT License
1.82k stars 74 forks source link

Training details about the t2v model. #63

Open HashimotoPatrickMu opened 2 weeks ago

HashimotoPatrickMu commented 2 weeks ago

Hi, I am currently using one A100 40 doing test on lumina-t2v model, may I ask the gpu type used for training the T2V model. And I also wonder the number of frames?

My implementation follows these steps:

  1. I followed the paper, added another flatten and unflatten operations along the frame dimension.
  2. In order to save time, I did the preprocessing separatedly before starting training, including llama and vae. But the vae is identical to the one used in t2i, so I worry it might not be able to capture enough temporal consistency.

In my testing, the video tensor stops at b=4,f=8,c=4,h=32,w=32 (after embedding) out of the memory issue. So it might be sort of impossible to even do the small-scale tests to verify your temporal-spatial merging method.

I am really interested in reading your training details, and the comparison between temporal-spatial dividing and merging strtegies. Your insights would be greatly helpful.

BurhanUlTayyab commented 2 weeks ago

+1