π€π€π€ Videotuna is a useful codebase for text-to-video applications.
π VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation (to the best of our knowledge).
π VideoTuna is the first repo that provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training (alignment), and fine-tuning (to the best of our knowledge).
π The models of VideoTuna include both U-Net and DiT architectures for visual generation tasks.
π A new 3D video VAE, and a controllable facial video generation model will be released soon.
π All-in-one framework: Inference and fine-tune up-to-date video generation models.
π Pre-training: Build your own foundational text-to-video model.
π Continuous training: Keep improving your model with new data.
π Domain-specific fine-tuning: Adapt models to your specific scenario.
π Concept-specific fine-tuning: Teach your models with unique concepts.
π Enhanced language understanding: Improve model comprehension through continuous training.
π Post-processing: Enhance the videos with video-to-video enhancement model.
π Post-training/Human preference alignment: Post-training with RLHF for more attractive results.
The 3D video VAE from VideoTuna can accurately compress and reconstruct the input videos with fine details.
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Input 1 | Input 2 | Input 3 |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
VideoTuna/
βββ assets # put images for readme
βββ checkpoints # put model checkpoints here
βββ configs # model and experimental configs
βββ data # data processing scripts and dataset files
βββ docs # documentations
βββ eval # evaluation scripts
βββ inputs # input examples for testing
βββ scripts # train and inference python scripts
βββ shsripts # train and inference shell scripts
βββ src # model-related source code
βββ tests # testing scripts
βββ tools # some tool scripts
T2V-Models | HxWxL | Checkpoints |
---|---|---|
CogVideoX-2B | 720x480, 6s | Hugging Face |
CogVideoX-5B | 720x480, 6s | Hugging Face |
Open-Sora 1.0 | 512Γ512x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
VideoCrafter2 | 320x512x16 | Hugging Face |
VideoCrafter1 | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
I2V-Models | HxWxL | Checkpoints |
---|---|---|
CogVideoX-5B-I2V | 720x480, 6s | Hugging Face |
DynamiCrafter | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
Please check docs/CHECKPOINTS.md to download all the model checkpoints.
conda create --name videotuna python=3.10 -y
conda activate videotuna
pip install -r requirements.txt
git clone https://github.com/JingyeChen/SwissArmyTransformer
pip install -e SwissArmyTransformer/
git clone https://github.com/tgxs002/HPSv2.git
cd ./HPSv2
pip install -e .
cd ..
conda config --add channels conda-forge
conda install ffmpeg
Please follow docs/CHECKPOINTS.md to download model checkpoints.
After downloading, the model checkpoints should be placed as Checkpoint Structure.
bash tools/video_comparison/compare.sh
inference_methods="videocrafter2;dynamicrafter;cogvideoβt2v;cogvideoβi2v;opensora"
inference_methods
variable in compare.sh
, and list the desired models separated by semicolons.input_dir
variable. This directory should contain a prompts.txt
file, where each line corresponds to a prompt for the video generation. The default input_dir
is inputs/t2v
bash tools/video_comparison/compare_i2v.sh
Task | Model | Command | Length (#frames) | Resolution | Inference Time (s) | GPU Memory (GiB) |
---|---|---|---|---|---|---|
I2V | CogVideoX-5b-I2V | bash shscripts/inference_cogVideo_i2v_diffusers.sh |
49 | 576x1024 | 310.4 | 4.78 |
T2V | CogVideoX-2b | bash shscripts/inference_cogVideo_t2v_diffusers.sh |
49 | 576x1024 | 107.6 | 2.32 |
T2V | Open Sora V1.0 | bash shscripts/inference_opensora_v10_16x256x256.sh |
16 | 256x256 | 11.2 | 23.99 |
T2V | VideoCrafter-V2-320x512 | bash shscripts/inference_vc2_t2v_320x512.sh |
16 | 320x512 | 26.4 | 10.03 |
T2V | VideoCrafter-V1-576x1024 | bash shscripts/inference_vc1_t2v_576x1024.sh |
16 | 576x1024 | 91.4 | 14.57 |
I2V | DynamiCrafter | bash shscripts/inference_dc_i2v_576x1024.sh |
16 | 576x1024 | 101.7 | 52.23 |
I2V | VideoCrafter-V1 | bash shscripts/inference_vc1_i2v_320x512.sh |
16 | 320x512 | 26.4 | 10.03 |
T2I | Flux-dev | bash shscripts/inference_flux.sh |
1 | 768x1360 | 238.1 | 1.18 |
T2I | Flux-schnell | bash shscripts/inference_flux.sh |
1 | 768x1360 | 5.4 | 1.20 |
Please follow the docs/datasets.md to try provided toydataset or build your own datasets.
Before started, we assume you have finished the following two preliminary steps: 1) Install the environment 2) Prepare the dataset 3) Download the checkpoints and get these two checkpoints
ll checkpoints/videocrafter/t2v_v2_512/model.ckpt
ll checkpoints/stablediffusion/v2-1_512-ema/model.ckpt
First, run this command to convert the VC2 checkpoint as we make minor modifications on the keys of the state dict of the checkpoint. The converted checkpoint will be automatically save at checkpoints/videocrafter/t2v_v2_512/model_converted.ckpt
.
python tools/convert_checkpoint.py --input_path checkpoints/videocrafter/t2v_v2_512/model.ckpt
Second, run this command to start training on the single GPU. The training results will be automatically saved at results/train/${CURRENT_TIME}_${EXPNAME}
bash shscripts/train_videocrafter_v2.sh
We support lora finetuning to make the model to learn new concepts/characters/styles.
configs/001_videocrafter2/vc2_t2v_lora.yaml
bash shscripts/train_videocrafter_lora.sh
bash shscripts/inference_vc2_t2v_320x512_lora.sh
We support open-sora finetuning, you can simply run the following commands:
# finetune the Open-Sora v1.0
bash shscripts/train_opensorav10.sh
We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.
We thank the following repos for sharing their awesome models and codes!
Please follow CC-BY-NC-ND. If you want a license authorization, please contact the project leads Yingqing He (yhebm@connect.ust.hk) and Yazhou Xing (yxingag@connect.ust.hk).
@software{videotuna,
author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
month = {Nov},
year = {2024},
url = {https://github.com/VideoVerses/VideoTuna}
}