YBYBZhang / ControlVideo

[ICLR 2024] Official pytorch implementation of "ControlVideo: Training-free Controllable Text-to-Video Generation"
MIT License
779 stars 56 forks source link

Single Character #17

Open p0mad opened 1 year ago

p0mad commented 1 year ago

Hi, Is it possible to generate a single character from the Pose for more than 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controled (pose) input?

Thanks Best regards

YBYBZhang commented 1 year ago

Hi pomad, thanks for your attention! In my machine (11GB 2080TI), it is feasible to produce a consistent video conditioned on human pose with about 100 frames (i.e., 4~5 seconds in 24 fps), which is shown in https://github.com/YBYBZhang/ControlVideo#long-video-generation.

p0mad commented 1 year ago

@YBYBZhang Thats great. But have you initialized the pose with some inputs (video or an image)?

I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar) Sample Video

human pose with about 100 frames (i.e., 4~5 seconds in 24 fps), which is shown in #long-video-generation.

The hulk sized grows and the face/hairs changes during the generated video! Do you have any idea on how to have a fixed sized and consistent character?

Thanks Best regards

YBYBZhang commented 1 year ago

@p0mad The synthesized Hulk video is initialized with poses below. Now, our ControlVideo ensures video consistency with fully cross-frame attention only. In future, adding temporal attention by finetuning on sufficient videos may improve size and character consistency! https://github.com/YBYBZhang/ControlVideo/assets/40799060/21b53efe-2167-4f74-afc2-3bec021acf20

p0mad commented 1 year ago

@YBYBZhang Thanks for the detailed information. Would you please also give me some insights / guides into the Hands+Face of the Pose? Is there any model that i can use? ( i see that ControlNet has the Full-OpenPose) but as i tested in the HF space, it wont care about it! Is there any reason? (bad output)

Also would you please provide me some prompts that output a consistent character for the provided pose ( like a boy playing something with black background and animation style) to get a consistent character able to dance with the correct generation of face and hands?

This was my best bet on generation with the pose! final_result (2)

Thanks Best regards

YBYBZhang commented 1 year ago

Full-Openpose ControlNet is trained based on Stable Diffusion v1.5, and thus inherits its limitations in producing low-quality hands and faces. I have tried to produce a video using ControlVideo (ControlNet v1.1, full-openpose), with a simple prompt "A man, animation style." As shown below, the sythensized video looks more consistent than that from vanilla ControlNet. I hope this would help you.

https://github.com/YBYBZhang/ControlVideo/assets/40799060/31fc2127-b296-4727-b161-700aade31d0b

p0mad commented 1 year ago

@YBYBZhang Thank you so much for your time. Would you please guide me into the steps of generating this video?

You have installed controlnet 1.1, download the OpenPose-full weights and then selected the openpose-full, put the "A man, animation style." in prompt box, input the video pose ( or have you used batch?) and then generate without any other input? How about the seed, steps?

is there any other ways to improve hand and face accuracy? like using openpifpaf as mentioned in the ControlNet paper ( which is on SD2.1)? or SD2.1/SD-XL for OpenPose-full version?

Also would you please let me know your GPU and Mem, CPU?

Thanks Best regards

YBYBZhang commented 1 year ago

With 2080Ti 11GB GPU, I use the following script to produce above video:

python inference.py \
    --prompt "A man, animation style." \
    --condition "openpose" \
    --video_path "data/pose1.mp4" \
    --output_path "outputs/" \
    --video_length 55 \
    --smoother_steps 19 20 \
    --width 512 \
    --height 512 \
    --frame_rate 2 \
    --version v11 \
    --is_long_video

where pose1.mp4 is center-cropped from your pose video.

I haven't explored using higher SD or ControlNet to enhance hand and face, but I believe that they could achieve this goal.

p0mad commented 1 year ago

@YBYBZhang, Thank you so much

I have Five questions in regards:

--condition "openpose"

  1. As it is ContNet 1.1, Do you think the output with "openpose_full" could lead to better results? Activate Hand-Face as mentioned in the HF CN11 we need to put hand_and_face=True. but i couldnt found such a thing on your repo

.

--video_length 55 \

  1. I was wondering why you have choosen 55 as the video Length?

The original Pose: Sample Video image shows the length of 1.8 s

while the output video: yours(output)

image

The Results is confusing to me(Shows 2.7s)!

Have you changed the input (Pose) to the 30Fps and center cropped? can you please send the cropped version of my pose?

Do we able to generate output with 24 or 30FPS instead of 20? (is that --smoother_steps option?) .

but I believe that they could achieve this goal

  1. Can i use SD 2.1 with controlNet 1.1 with the same openPose Weights? or it needs to be trained on SD 2.1? (As you Mentioned defaults is SD1.5 with CN1.1)

.

  1. Would it be possible to Also input a random image ( desired character) as an intial Character to the SD+CN? Example image: image

  2. Is it possible to set number of steps for each frame in Unet?

Thanks Again Best regard

YBYBZhang commented 1 year ago
  1. "openpose" and "openpose_full" shares the same type of ControlNet. The given video is poses with hand and face landmarks, so I directly input it into ControlVideo.
  2. The video is length of about 110, so I choose 55 for efficient generation. The input and output fps are different, and you could set output fps in this line. You could directly crop the video in this website.
  3. Open-sourced ControlNet is trained based on SDv1.5. If you want to use SD v2 based ControlNet, you must retrain it.
  4. With both "shuffle ControlNet" and "openpose ControlNet", this goal might be achieved.
  5. Maybe possible, but there is no corresponding implementation as far as I know.