Official Implementation of "Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models"
Similar to Controlnet, We otain the condition maps from another video, and we support three kinds of control maps at this time.
depth control | canny control | hed control |
---|---|---|
a bear walking through stars,artstation |
a dog, comicbook style |
person riding horse, watercolor |
The model has been tesed in torch version: 1.13.1+cu117
, simply run
pip3 install -r requirements.txt
We provide a demo for quick testing in this repo, simply running:
python3 inference.py --prompt "a bear walking through stars, artstation" --input_video bear.mp4 --control_mode depth
Args:
--input_video
: path of input video(mp4 format).
--num_sample_frames
: nums of frames to generate. (recommend > 8).
--each_sample_frame
: sampling frames for each time. (for auto-regressive generateion.)
--sampling_rate
: skip sampling from the input video.
--control_mode
: allows for different control, currently support canny
, depth
, hed
. (you need to download the weight of hed annotator from link and put it in work space.)
--video_scale
: guidance scale of video consistency, borrows from GEN-1. (don't be too large, 1~2 work well, set 0 to disable it.)
--init_noise_thres
: the propoed threshold of residual-based noise init. (range from 0 to 1, larger value leads to more smooth but may introduce artifacts.)
--inference_step, --guidance_scale, --height, --width, --prompt
: same as other T2I model.
If the automatic downloading not work, the models weights can be downloaded from: depth_control_model, canny_control_model, hed_control_model.
Our model firstly generates the first frame. Once We get the first frame, we generate the subsquent frames conditioned on the first frame. Thus, it will allow our model to generate longer videos auto-regressive. (This operation is still under experiment and it may collaspe after 3 or 4 iterations.)
python3 inference.py --prompt "a bear walking through stars, artstation" --input_video bear.mp4 --control_mode depth --num_sample_frames 16 --each_sample_frame 8
Note that num_sample_frames
should be multiple of each_sample_frame
.
Since we freeze the 2d model, you can replace it with any other model based on stable-diffusion-v1-5
to generate custom-style videos.
state_dict_path = os.path.join(pipeline_model_path, "unet", "diffusion_pytorch_model.bin")
state_dict = torch.load(state_dict_path, map_location="cpu")
video_controlnet_pipe.unet.load_2d_state_dict(state_dict=state_dict) # reload 2d model.
@misc{chen2023controlavideo,
title={Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models},
author={Weifeng Chen and Jie Wu and Pan Xie and Hefeng Wu and Jiashi Li and Xin Xia and Xuefeng Xiao and Liang Lin},
year={2023},
eprint={2305.13840},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This repository borrows heavily from Diffusers, ControlNet, Tune-A-Video, thanks for open-sourcing! This work was done in Bytedance, thanks for the cooperators!