SAIS-FUXI / VidGen

55 stars 4 forks source link

VIDGEN-1M

VIDGEN-1M: A LARGE-SCALE DATASET FOR TEXT-TO-VIDEO GENERATION

arXiv Project Page

Introduction

we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. We trained a video generation model using this data and open-source the model.

News

Contents

Install

  1. Clone this repository
  2. Install Package
    
    conda create -n vidgen python=3.10
    conda activate vidgen

pip install torch==2.2.2 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tqdm einops omegaconf bigmodelvis deepspeed tensorboard timm==0.9.16 ninja opencv-python opencv-python-headless ftfy bs4 beartype colossalai accelerate ultralytics webdataset

pip install -U xformers --index-url https://download.pytorch.org/whl/cu118


## VidGen-1M Datasets
To assist the community in researching and learning about video generation, we have made public [VidGen-1M](https://huggingface.co/datasets/Fudan-FUXI/VIDGEN-1M) high-quality video data.

## Model Weights
Please download the [Model weight](https://huggingface.co/Fudan-FUXI/VIDGEN-v1.0) from huggingface.

## Sampling 
You can use a single GPU or multiple GPUs for inference. The script has various arguments.
```bash
bash scripts/sample_t2v.sh

Citation

@article{tan2024vdgen-1m,
  title={VIDGEN-1M: A LARGE-SCALE DATASET FOR TEXTTO-VIDEO GENERATION},
  author={Tan, Zhiyu and Yang, Xiaomeng and Qin, Luozheng and Li, Hao},
  journal={arXiv preprint arXiv:2408.02629},
  year={2024},
  institution={Fudan University and Shanghai Academy of AI for Science},
}