360CVGroup / FancyVideo

This is the official reproduction of FancyVideo.
472 stars 69 forks source link

[no issue] Congratulation and a question how to break the 64 / 61 frames? #9

Open ibrainventures opened 3 weeks ago

ibrainventures commented 3 weeks ago

Hi, thank you very much for your work and make this accesable to the world. I tested the last 20 hours many generations (mostly realistic) and i am very impressed about the results. The results from "people action" dedicated prompts are looking absolute realistic.

No morph-style, so its great to see - what to squeeze out of the (good-old) 1.5 SD on Enduser Hardware (okay my vast-ai-ed 4090s are not typical :- ) EU ) ..

I tried to understand your paper and saw the interpolation and 3D UNET related solutions / experiments.

Questions:

A) How would you estimate the possibility for a > 10 sec (+250frames) or longer generations? B) if based on the 64/61 frames - >by offloading and unrecognizable merging / stitching pipelines ? C) By wider-stepped interpolation? - under less quality but longer "action" ?

Would be great to get a feedback and Chapeau to the Team! Great work!!

MaAo commented 3 weeks ago

Hi, thank you very much for your work and make this accesable to the world. I tested the last 20 hours many generations (mostly realistic) and i am very impressed about the results. The results from "people action" dedicated prompts are looking absolute realistic.

No morph-style, so its great to see - what to squeeze out of the (good-old) 1.5 SD on Enduser Hardware (okay my vast-ai-ed 4090s are not typical :- ) EU ) ..

I tried to understand your paper and saw the interpolation and 3D UNET related solutions / experiments.

Questions:

A) How would you estimate the possibility for a > 10 sec (+250frames) or longer generations? B) if based on the 64/61 frames - >by offloading and unrecognizable merging / stitching pipelines ? C) By wider-stepped interpolation? - under less quality but longer "action" ?

Would be great to get a feedback and Chapeau to the Team! Great work!!

Thank you for your attention and recognition of our work. Here are the answers to your questions:

A) To obtain more frames in a video, you have two options:

a) Using the current 61-frame video generation model, you can iteratively generate additional frames by using the end frame of the previous video as the reference for the next video.

b) We will release models in the future with more frames, such as 125 or more. However, keep in mind that these models will require more memory for inference.

B) I didn't fully understand your question B) . If my response does not completely address your issue, please clarify further, and I will do my best to assist you.

C) We are currently utilizing Video VAEs, which obviates the need for interpolation during model generation. For instance, with a latent space of (1,4,16,64,64), decoding the Video VAEs produces a video with dimensions of (1,3,61,512,512). The temporal dimension is computed as 4𝑛-3, where 𝑛 represents the number of frames in the latent space. Our research indicates that the current Video VAEs are constrained by the number of channels. Therefore, we plan to train 16-channel Video VAEs and integrate them into our project in the future.