Open Hugo-cell111 opened 1 month ago
Hi, thanks for your interest. We train the motion adapter using all frames of input videos. Meanwhile, we select a random frame serving as the appearance guidance.
Thanks for your response! I also have another few questions: (1) how long does it take for each stage in DreamVideo? I have tried in my own server and found that it takes about 2 hours for just the 1st stage in subject learning. Is it normal? I use 4 V100 PCIE GPUs; (2) Could you provide the link of open_clip_pytorch_model.bin of FrozenOpenCLIPCustomEmbedder?
Thanks for your response! I also have another few questions: (1) how long does it take for each stage in DreamVideo? I have tried in my own server and found that it takes about 2 hours for just the 1st stage in subject learning. Is it normal? I use 4 V100 PCIE GPUs; (2) Could you provide the link of open_clip_pytorch_model.bin of FrozenOpenCLIPCustomEmbedder?
Hi. (1) We use one A100 80G GPU. It takes about 50 min for step 1 in subject learning and 10~15 min for step 2. I think your situation is normal due to device differences. By the way, you can reduce the number of training iterations to balance the performance and time costs. (2) The 'open_clip_pytorch_model.bin' used in DreamVideo is the same as the other models (I2VGen-XL, HiGen, TF-T2V, etc.) in this repository. You can download the ckpt from this link: https://modelscope.cn/api/v1/models/iic/tf-t2v/repo?Revision=master&FilePath=open_clip_pytorch_model.bin.
Thank you very much! By the way, how long does it take to evaluate on all datasets mentioned in the paper of DreamVideo? Could you provide the evaluation code?
Hi! I find that each time one frame of the guided video is selected to train the motion adapter. But since selecting only one image will break the coherence of a video, I wonder how the motion adapter can capture the temporal motion pattern? Thanks!