alibaba / Tora

The official repository for paper "Tora: Trajectory-oriented Diffusion Transformer for Video Generation"
https://ali-videoai.github.io/tora_video
Apache License 2.0
1.03k stars 38 forks source link

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang \* equal contribution

This is the official repository for paper "Tora: Trajectory-oriented Diffusion Transformer for Video Generation".

πŸ’‘ Abstract

Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.

πŸ“£ Updates

πŸ“‘ Table of Contents

🎞️ Showcases

https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1

https://github.com/user-attachments/assets/7e7dbe87-a8ba-4710-afd0-9ef528ec329b

https://github.com/user-attachments/assets/4026c23d-229d-45d7-b5be-6f3eb9e4fd50

All videos are available in this Link

βœ… TODO List

🐍 Installation

# Clone this repository.
git clone https://github.com/alibaba/Tora.git
cd Tora

# Install Pytorch (we use Pytorch 2.4.0) and torchvision following the official instructions: https://pytorch.org/get-started/previous-versions/. For example:
conda create -n tora python==3.10
conda activate tora
conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia

# Install requirements
cd modules/SwissArmyTransformer
pip install -e .
cd ../../sat
pip install -r requirements.txt
cd ..

πŸ“¦ Model Weights

Folder Structure

Tora
└── sat
    └── ckpts
        β”œβ”€β”€ t5-v1_1-xxl
        β”‚   β”œβ”€β”€ model-00001-of-00002.safetensors
        β”‚   └── ...
        β”œβ”€β”€ vae
        β”‚   └── 3d-vae.pt
        β”œβ”€β”€ tora
        β”‚   └── t2v
        β”‚       └── mp_rank_00_model_states.pt
        └── CogVideoX-5b-sat # for training stage 1
            └── mp_rank_00_model_states.pt

Download Links

Note: Downloading the tora weights requires following the CogVideoX License. You can choose one of the following options: HuggingFace, ModelScope, or native links. After downloading the model weights, you can put them in the Tora/sat/ckpts folder.

HuggingFace

# This can be faster
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Le0jc/Tora --local-dir ckpts

or

# use git
git lfs install
git clone https://huggingface.co/Le0jc/Tora

ModelScope

from modelscope import snapshot_download
model_dir = snapshot_download('xiaoche/Tora')
git clone https://www.modelscope.cn/xiaoche/Tora.git

Native

πŸ”„ Inference

It requires around 30 GiB GPU memory tested on NVIDIA A100.

cd sat
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt

You can change the --input-file and --point_path to your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas.

Replace $N_GPU with the number of GPUs you want to use.

Recommendations for Text Prompts

For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.

You can refer to the following resources for guidance:

πŸ–₯️ Gradio Demo

Usage:

cd sat
python app.py --load ckpts/tora/t2v

🧠 Training

It requires around 60 GiB GPU memory tested on NVIDIA A100.

Replace $N_GPU with the number of GPUs you want to use.

Text to Video

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_dense.yaml --experiment-name "t2v-stage1"
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_sparse.yaml --experiment-name "t2v-stage2"

🎯 Troubleshooting

1. ValueError: Non-consecutive added token...

Upgrade the transformers package to 4.44.2. See this issue.

🀝 Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:

Special thanks to the contributors of these libraries for their hard work and dedication!

πŸ“„ Our previous work

πŸ“š Citation

@misc{zhang2024toratrajectoryorienteddiffusiontransformer,
      title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation},
      author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
      year={2024},
      eprint={2407.21705},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.21705},
}