Question about max frame numbers

MyNiuuu / MOFA-Video

Official Pytorch implementation for MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model.

https://myniuuu.github.io/MOFA_Video

Other

358 stars 22 forks source link

Question about max frame numbers #13

Closed SWAPv1 closed 3 days ago

SWAPv1 commented 1 week ago

Amazing project! What is the max frames it generates?

MyNiuuu commented 1 week ago

For trajectory-based and hybrid control, as we base our approach on SVD, all results in our project page contain 25 frames. Nevertheless, we believe that a bit more frames could be achievable, provided that sufficient GPU memory is available.

For landmark-based facial animation, we can achieve significantly longer videos via our periodic sampling strategy (refer to our paper for more details).

YunjieYu commented 6 days ago

@MyNiuuu Hello, How do we adjust the max frame numbers for hybrid control? In your demo's instructions, you mentioned that current version of hybrid control only supports 25 frames. I simlpy adjust: DragNUWA_net = Drag("cuda:0", target_size, target_size, 100) in 957 row of run_gradio_video_driven.py to generate 100 frames. But the result is not good. Can you give me some tips on how to increase the max frames while ensuring a good generation quality?

MyNiuuu commented 6 days ago

@MyNiuuu Hello, How do we adjust the max frame numbers for hybrid control? In your demo's instructions, you mentioned that current version of hybrid control only supports 25 frames. I simlpy adjust: DragNUWA_net = Drag("cuda:0", target_size, target_size, 100) in 957 row of run_gradio_video_driven.py to generate 100 frames. But the result is not good. Can you give me some tips on how to increase the max frames while ensuring a good generation quality?

Hi, I think simply changing the frame number in #line 957 poses no problem from the perspective of code logic. I have actually tested the model on slightly longer videos, and the results seem alright. However, considering that both our MOFA-Adapter and the SVD utilize temporal attention layers, there is no guarantee that the model will achieve good results for significantly longer frames, as these temporal layers have been trained on short frame lengths.

yusuke-ai commented 1 day ago

@MyNiuuu Hi! From the perspective of architecture, Do you think it's possible to apply periodic sampling strategy to the hybrid control?

MyNiuuu commented 1 day ago

@yusuke-ai Hi!

Applying periodic sampling strategy for hybrid control is theoretically applicable.

I have simply tried using periodic sampling on hybrid control, but the results have blurry and trembling issues for the trajectory-control part (the landmark-control part is fine with no such issues).

The reason might be: For the content generated under short inference (e.g., 25 frames) in trajectory-based control, there is a gradual spatial transition from the first frame to the second frame. But, when employing a periodic sampling strategy, the spatial gap between the first frame (input image) and the second frame of the slide window (not the second frame of the total sequence) increases as we move the slide window towards the end of the total frame sequence. Such a large span is beyond the processing capability of the pre-trained SVD.