PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Apache License 2.0
10.87k stars 971 forks source link

Open-Sora Plan

slack badge WeChat badge hf_space Twitter
hf_space License GitHub repo contributors GitHub Commit Pr GitHub issues GitHub closed issues
GitHub repo stars  GitHub repo forks  GitHub repo watchers  GitHub repo size

v1.0.0 badge [![Twitter](https://img.shields.io/badge/-Twitter@LinBin46984-black?logo=twitter&logoColor=1D9BF0)](https://x.com/LinBin46984/status/1763476690385424554?s=20)
[![hf_space](https://img.shields.io/badge/๐Ÿค—-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/LanguageBind/Open-Sora-Plan-v1.0.0) [![hf_space](https://img.shields.io/badge/๐Ÿค—-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/fffiloni/Open-Sora-Plan-v1-0-0) [![Replicate demo and cloud API](https://replicate.com/camenduru/open-sora-plan-512x512/badge)](https://replicate.com/camenduru/open-sora-plan-512x512) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camenduru/Open-Sora-Plan-jupyter/blob/main/Open_Sora_Plan_jupyter.ipynb)

We are thrilled to present Open-Sora-Plan v1.1.0, which significantly enhances video generation quality and text control capabilities. See our report. We show compressed .gif on GitHub, which loses some quality.

Thanks to HUAWEI Ascend Team for supporting us. In the second stage, we used Huawei Ascend computing power for training. This stage's training and inference were fully supported by Huawei. Models trained on Huawei Ascend can also be loaded into GPUs and generate videos of the same quality.

็›ฎๅ‰ๅทฒ็ปๆ”ฏๆŒไฝฟ็”จๅ›ฝไบงAI่ฎก็ฎ—็ณป็ปŸ(ๅŽไธบๆ˜‡่…พ๏ผŒๆœŸๅพ…ๆ›ดๅคšๅ›ฝไบง็ฎ—ๅŠ›่Šฏ็‰‡)่ฟ›่กŒๅฎŒๆ•ด็š„่ฎญ็ปƒๅ’ŒๆŽจ็†ใ€‚ๅœจ้กน็›ฎ็ฌฌไบŒ้˜ถๆฎต๏ผŒๆ‰€ๆœ‰่ฎญ็ปƒๅ’ŒๆŽจ็†ไปปๅŠกๅฎŒๅ…จ็”ฑๅŽไธบๆ˜‡่…พ่ฎก็ฎ—็ณป็ปŸๆ”ฏๆŒใ€‚ๆญคๅค–๏ผŒๅŸบไบŽๅŽไธบๆ˜‡่…พ็š„512ๅก้›†็พค่ฎญ็ปƒๅ‡บ็š„ๆจกๅž‹๏ผŒไนŸๅฏไปฅๆ— ็ผๅœฐๅœจGPUไธŠ่ฟ่กŒ๏ผŒๅนถไฟๆŒ็›ธๅŒ็š„่ง†้ข‘่ดจ้‡ใ€‚่ฏฆ็ป†ไฟกๆฏ่ฏทๅ‚่€ƒๆˆ‘ไปฌ็š„hw branch.

221ร—512ร—512 Text-to-Video Generation

3D animation of a small, round, fluffy creature with big, expressive eyes explores ... A single drop of liquid metal falls from a floating orb, landing on a mirror-like ... The video presents an abstract composition centered around a hexagonal shape adorned ...
A drone camera circles around a beautiful historic church built on a rocky outcropping ... Aerial view of Santorini during the blue hour, showcasing the stunning architecture ... An aerial shot of a lighthouse standing tall on a rocky cliff, its beacon cutting ...
A snowy forest landscape with a dirt road running through it. The road is flanked by ... Drone shot along the Hawaii jungle coastline, sunny day. Kayaks in the water. The camera rotates around a large stack of vintage televisions all showing different ...

65ร—512ร—512 Text-to-Video Generation

In an ornate, historical hall, a massive tidal wave peaks and begins to crash. Two ... A Shiba Inu dog wearing a beret and black turtleneck. A painting of a boat on water comes to life, with waves crashing and the boat becoming ...
A person clad in a space suit with a helmet and equipped with a chest light and arm ... 3D animation of a small, round, fluffy creature with big, expressive eyes explores a ... In a studio, there is a painting depicting a ship sailing through the rough sea.
A robot dog trots down a deserted alley at night, its metallic paws clinking softly ... A lone surfer rides a massive wave, skillfully maneuvering through the surf. The water ... A solitary cheetah sprints across the savannah, its powerful muscles propelling it ...

65ร—512ร—512 Video Editing

Generated
Edited

512ร—512 Text-to-Image Generation

๐Ÿ“ฐ News

[2024.05.27] ๐Ÿš€๐Ÿš€๐Ÿš€ We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest report. Thanks to ShareGPT4Video's capability to annotate long videos.

[2024.04.09] ๐Ÿš€ Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos. Here is the dataset for train (updating): Open-Sora-Dataset.

[2024.04.07] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.

[2024.03.27] ๐Ÿš€๐Ÿš€๐Ÿš€ We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.

View more **[2024.03.10]** ๐Ÿš€๐Ÿš€๐Ÿš€ This repo supports training a latent size of 225ร—90ร—90 (tร—hร—w), which means we are able to **train 1 minute of 1080P video with 30FPS** (2ร— interpolated frames and 2ร— super resolution) under class-condition. **[2024.03.08]** We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from [Latte](https://github.com/Vchitect/Latte). **[2024.03.07]** We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512. **[2024.03.05]** See our latest [todo](https://github.com/PKU-YuanGroup/Open-Sora-Plan?tab=readme-ov-file#todo), pull requests are welcome. **[2024.03.04]** We re-organize and modulize our code to make it easy to [contribute](https://github.com/PKU-YuanGroup/Open-Sora-Plan?tab=readme-ov-file#how-to-contribute-to-the-open-sora-plan-community) to the project, to contribute please see the [Repo structure](https://github.com/PKU-YuanGroup/Open-Sora-Plan?tab=readme-ov-file#repo-structure). **[2024.03.03]** We open some [discussions](https://github.com/PKU-YuanGroup/Open-Sora-Plan/discussions) to clarify several issues. **[2024.03.01]** Training code is available now! Learn more on our [project page](https://pku-yuangroup.github.io/Open-Sora-Plan/). Please feel free to watch ๐Ÿ‘€ this repository for the latest updates.

๐Ÿ’ช Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome!!!

ๆœฌ้กน็›ฎๅธŒๆœ›้€š่ฟ‡ๅผ€ๆบ็คพๅŒบ็š„ๅŠ›้‡ๅค็ŽฐSora๏ผŒ็”ฑๅŒ—ๅคง-ๅ…”ๅฑ•AIGC่”ๅˆๅฎž้ชŒๅฎคๅ…ฑๅŒๅ‘่ตท๏ผŒๅฝ“ๅ‰็‰ˆๆœฌ็ฆป็›ฎๆ ‡ๅทฎ่ทไป็„ถ่พƒๅคง๏ผŒไป้œ€ๆŒ็ปญๅฎŒๅ–„ๅ’Œๅฟซ้€Ÿ่ฟญไปฃ๏ผŒๆฌข่ฟŽPull request๏ผ๏ผ๏ผ

Project stages:

โœŠ Todo #### Setup the codebase and train an unconditional model on landscape dataset - [x] Fix typos & Update readme. ๐Ÿค Thanks to [@mio2333](https://github.com/mio2333), [@CreamyLong](https://github.com/CreamyLong), [@chg0901](https://github.com/chg0901), [@Nyx-177](https://github.com/Nyx-177), [@HowardLi1984](https://github.com/HowardLi1984), [@sennnnn](https://github.com/sennnnn), [@Jason-fan20](https://github.com/Jason-fan20) - [x] Setup environment. ๐Ÿค Thanks to [@nameless1117](https://github.com/nameless1117) - [ ] Add docker file. โŒ› [WIP] ๐Ÿค Thanks to [@Mon-ius](https://github.com/Mon-ius), [@SimonLeeGit](https://github.com/SimonLeeGit) - [ ] Enable type hints for functions. ๐Ÿค Thanks to [@RuslanPeresy](https://github.com/RuslanPeresy), ๐Ÿ™ **[Need your contribution]** - [x] Resume from checkpoint. - [x] Add Video-VQVAE model, which is borrowed from [VideoGPT](https://github.com/wilson1yan/VideoGPT). - [x] Support variable aspect ratios, resolutions, durations training on [DiT](https://github.com/facebookresearch/DiT). - [x] Support Dynamic mask input inspired by [FiT](https://github.com/whlzy/FiT). - [x] Add class-conditioning on embeddings. - [x] Incorporating [Latte](https://github.com/Vchitect/Latte) as main codebase. - [x] Add VAE model, which is borrowed from [Stable Diffusion](https://github.com/CompVis/latent-diffusion). - [x] Joint dynamic mask input with VAE. - [ ] Add VQVAE from [VQGAN](https://github.com/CompVis/taming-transformers). ๐Ÿ™ **[Need your contribution]** - [ ] Make the codebase ready for the cluster training. Add SLURM scripts. ๐Ÿ™ **[Need your contribution]** - [x] Refactor VideoGPT. ๐Ÿค Thanks to [@qqingzheng](https://github.com/qqingzheng), [@luo3300612](https://github.com/luo3300612), [@sennnnn](https://github.com/sennnnn) - [x] Add sampling script. - [ ] Add DDP sampling script. โŒ› [WIP] - [x] Use accelerate on multi-node. ๐Ÿค Thanks to [@sysuyy](https://github.com/sysuyy) - [x] Incorporate [SiT](https://github.com/willisma/SiT). ๐Ÿค Thanks to [@khan-yin](https://github.com/khan-yin) - [x] Add evaluation scripts (FVD, CLIP score). ๐Ÿค Thanks to [@rain305f](https://github.com/rain305f) #### Train models that boost resolution and duration - [x] Add [PI](https://arxiv.org/abs/2306.15595) to support out-of-domain size. ๐Ÿค Thanks to [@jpthu17](https://github.com/jpthu17) - [x] Add 2D RoPE to improve generalization ability as [FiT](https://github.com/whlzy/FiT). ๐Ÿค Thanks to [@jpthu17](https://github.com/jpthu17) - [x] Compress KV according to [PixArt-sigma](https://pixart-alpha.github.io/PixArt-sigma-project). - [x] Support deepspeed for videogpt training. ๐Ÿค Thanks to [@sennnnn](https://github.com/sennnnn) - [x] Train a **low dimension** Video-AE, whether it is VAE or VQVAE. - [x] Extract offline feature. - [x] Train with offline feature. - [x] Add frame interpolation model. ๐Ÿค Thanks to [@yunyangge](https://github.com/yunyangge) - [x] Add super resolution model. ๐Ÿค Thanks to [@Linzy19](https://github.com/Linzy19) - [x] Add accelerate to automatically manage training. - [x] Joint training with images. - [ ] Implement [MaskDiT](https://github.com/Anima-Lab/MaskDiT) technique for fast training. ๐Ÿ™ **[Need your contribution]** - [ ] Incorporate [NaViT](https://arxiv.org/abs/2307.06304). ๐Ÿ™ **[Need your contribution]** - [ ] Add [FreeNoise](https://github.com/arthur-qiu/FreeNoise-LaVie) support for training-free longer video generation. ๐Ÿ™ **[Need your contribution]** #### Conduct text2video experiments on landscape dataset. - [x] Load pretrained weights from [Latte](https://github.com/Vchitect/Latte). - [ ] Implement [PeRFlow](https://github.com/magic-research/piecewise-rectified-flow) for improving the sampling process. ๐Ÿ™ **[Need your contribution]** - [x] Finish data loading, pre-processing utils. - [x] Add T5 support. - [x] Add CLIP support. ๐Ÿค Thanks to [@Ytimed2020](https://github.com/Ytimed2020) - [x] Add text2image training script. - [ ] Add prompt captioner. - [ ] Collect training data. - [ ] Need video-text pairs with caption. ๐Ÿ™ **[Need your contribution]** - [ ] Extract multi-frame descriptions by large image-language models. ๐Ÿค Thanks to [@HowardLi1984](https://github.com/HowardLi1984) - [ ] Extract video description by large video-language models. ๐Ÿ™ **[Need your contribution]** - [ ] Integrate captions to get a dense caption by using a large language model, such as GPT-4. ๐Ÿค Thanks to [@HowardLi1984](https://github.com/HowardLi1984) - [ ] Train a captioner to refine captions. ๐Ÿš€ **[Require more computation]** #### Train the 1080p model on video2text dataset - [ ] Looking for a suitable dataset, welcome to discuss and recommend. ๐Ÿ™ **[Need your contribution]** - [ ] Add synthetic video created by game engines or 3D representations. ๐Ÿ™ **[Need your contribution]** - [x] Finish data loading, and pre-processing utils. - [x] Support memory friendly training. - [x] Add flash-attention2 from pytorch. - [x] Add xformers. ๐Ÿค Thanks to [@jialin-zhao](https://github.com/jialin-zhao) - [x] Support mixed precision training. - [x] Add gradient checkpoint. - [x] Support for ReBased and Ring attention. ๐Ÿค Thanks to [@kabachuha](https://github.com/kabachuha) - [x] Train using the deepspeed engine. ๐Ÿค Thanks to [@sennnnn](https://github.com/sennnnn) - [ ] Train with a text condition. Here we could conduct different experiments: ๐Ÿš€ **[Require more computation]** - [x] Train with T5 conditioning. - [ ] Train with CLIP conditioning. - [ ] Train with CLIP + T5 conditioning (probably costly during training and experiments). - [ ] Support Chinese. โŒ› [WIP] #### Control model with more condition - [ ] Incorporating [ControlNet](https://github.com/lllyasviel/ControlNet). โŒ› [WIP] ๐Ÿ™ **[Need your contribution]** - [ ] Incorporating [ReVideo](https://github.com/MC-E/ReVideo). โŒ› [WIP]

๐Ÿ“‚ Repo structure (WIP)

โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ docs
โ”‚   โ”œโ”€โ”€ Data.md                    -> Datasets description.
โ”‚   โ”œโ”€โ”€ Contribution_Guidelines.md -> Contribution guidelines description.
โ”œโ”€โ”€ scripts                        -> All scripts.
โ”œโ”€โ”€ opensora
โ”‚ย ย  โ”œโ”€โ”€ dataset
โ”‚ย ย  โ”œโ”€โ”€ models
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ ae                     -> Compress videos to latents
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ imagebase
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ vae
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ vqvae
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ videobase
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย      โ”œโ”€โ”€ vae
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย      โ””โ”€โ”€ vqvae
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ captioner
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ diffusion              -> Denoise latents
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ diffusion         
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ dit
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ latte
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ unet
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ frame_interpolation
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ super_resolution
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ text_encoder
โ”‚ย ย  โ”œโ”€โ”€ sample
โ”‚ย ย  โ”œโ”€โ”€ train                      -> Training code
โ”‚ย ย  โ””โ”€โ”€ utils

๐Ÿ› ๏ธ Requirements and Installation

  1. Clone this repository and navigate to Open-Sora-Plan folder
    git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
    cd Open-Sora-Plan
  2. Install required packages
    conda create -n opensora python=3.8 -y
    conda activate opensora
    pip install -e .
  3. Install additional packages for training cases
    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation
  4. Install optional requirements such as static type checking:
    pip install -e '.[dev]'

๐Ÿ—๏ธ Usage

๐Ÿค— Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command. We also provide online demo hf_space.

v1.0.0 Highly recommend trying out our web demo by the following command. We also provide [online demo](https://huggingface.co/spaces/LanguageBind/Open-Sora-Plan-v1.0.0) [![hf_space](https://img.shields.io/badge/๐Ÿค—-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/LanguageBind/Open-Sora-Plan-v1.0.0) and [![hf_space](https://img.shields.io/badge/๐Ÿค—-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/fffiloni/Open-Sora-Plan-v1-0-0) in Huggingface Spaces. ๐Ÿค Enjoying the [![Replicate demo and cloud API](https://replicate.com/camenduru/open-sora-plan-512x512/badge)](https://replicate.com/camenduru/open-sora-plan-512x512) and [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camenduru/Open-Sora-Plan-jupyter/blob/main/Open_Sora_Plan_jupyter.ipynb), created by [@camenduru](https://github.com/camenduru), who generously supports our research!

For the 65 frames.

python -m opensora.serve.gradio_web_server --version 65x512x512

For the 221 frames.

python -m opensora.serve.gradio_web_server --version 221x512x512

CLI Inference

sh scripts/text_condition/sample_video.sh

Datasets

Refer to Data.md

Evaluation

Refer to the document EVAL.md.

CausalVideoVAE

Reconstructing

Example:

python examples/rec_imvi_vae.py --video_path test_video.mp4 --rec_path output_video.mp4 --fps 24 --resolution 512 --crop_size 512 --num_frames 128 --sample_rate 1 --ae CausalVAEModel_4x8x8 --model_path pretrained_488_release --enable_tiling --enable_time_chunk

Parameter explanation:

Training and Eval

Please refer to the document CausalVideoVAE.

VideoGPT VQVAE

Please refer to the document VQVAE.

Video Diffusion Transformer

Training

sh scripts/text_condition/train_videoae_65x512x512.sh
sh scripts/text_condition/train_videoae_221x512x512.sh
sh scripts/text_condition/train_videoae_513x512x512.sh

๐Ÿ’ก How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

๐Ÿ‘ Acknowledgement

๐Ÿ”’ License

โœ๏ธ Citing

BibTeX

@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
  author       = {PKU-Yuan Lab and Tuzhan AI etc.},
  title        = {Open-Sora-Plan},
  month        = apr,
  year         = 2024,
  publisher    = {GitHub},
  doi          = {10.5281/zenodo.10948109},
  url          = {https://doi.org/10.5281/zenodo.10948109}
}

Latest DOI

DOI

๐Ÿค Community contributors