CelebV-Text: A Large-Scale Facial Text-Video Dataset (CVPR 2023)

CelebV-Text: A large-Scale Facial Text-Video Dataset
Jianhui Yu*, Hao Zhu*, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu
(*Equal contribution)
Demo Video | Project Page | Paper (arxiv)

Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.

Updates

[11/08/2023]
- Audios (67k) can be downloaded now issue
[20/06/2023]
- Videos can be downloaded now issue
[28/03/2023]
- Paper is now released here!
[01/01/2023]
- Code of MMVID-interp is now released here.
- Pretrained models of benchmarks are released here.
- data annotation file is now released here.
[28/12/2022]
- The codebase and project page are created.
- The download and processing tools for the dataset is released. Use them to construct your CelebV-Text!
[04/01/2024]
- Confusions about annotation files are expalined here.

Dataset Statistics
Agreement
Dataset Download
- Text Descriptions
- Video Download Pipeline
Benchmark
- Baselines
- Pretrained Models
Related Work
Citation
Acknowledgement

TODO

[x] Video download and processing tools.
[x] Text descriptions.
[x] Data annotations.
[x] Code of MMVID-interp.
[ ] Automatic text generation tool and templates.
[x] Pretrained models of benchmarks.

Dataset Statistics

https://user-images.githubusercontent.com/10545746/227458030-fbb48f66-db14-4c89-a001-4d7cdd29b248.mp4

The distributions of each attribute. CelebV-Text contains 70,000 video clips with a total duration of around 279 hours. Each video is accompanied by 20 sentences describing 6 designed attributes, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.

Visual ChatGPT Demo

This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.

https://user-images.githubusercontent.com/10545746/226870355-83c9c875-0e3b-439e-9df7-453d8e408807.mp4

Agreement

The CelebV-Text dataset is available for non-commercial research purposes only.
All videos of the CelebV-Text dataset are obtained from the Internet which are not property of our institutions. Our institutions are not responsible for the content nor the meaning of these videos.
You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data.
You agree not to further copy, publish or distribute any portion of the CelebV-Text dataset. Except, for internal use at a single site within the same organization it is allowed to make copies of the dataset.

Dataset Download

(1) Text Descriptions & Metadata Annotation

Description	Link
general & detailed face attributes	Google Drive
emotion	Google Drive
action	Google Drive
light direction	Google Drive
light intensity	Google Drive
light color temperature	Google Drive
*metadata annotation	Google Drive

(2) Video Download Pipeline

Prepare the environment & Run script:

# prepare the environment
pip install youtube_dl
pip install opencv-python

# you can change the download folder in the code 
python download_and_process.py

JSON File Structure:

{
    "clips":
    {
        "0-5BrmyFsYM_0":  // clip 1 
        {
            "ytb_id": "0-5BrmyFsYM",                                        // youtube id
            "duration": {"start_sec": 0.0, "end_sec": 9.64},                // start and end times in the original video
            "bbox": {"top": 0, "bottom": 937, "left": 849, "right": 1872},  // bounding box
            "version": "v0.1"
        },

        "00-30GQl0TM_7":  // clip 2 
        {
            "ytb_id": "00-30GQl0TM",                                        // youtube id
            "duration": {"start_frame": 415.29, "end_frame": 420.88},       // start and end times in the original video
            "bbox": {"top": 0, "bottom": 1183, "left": 665, "right": 1956}, // bounding box
            "version": "v0.1"
        },
        "..."
        "..."

    }
}

Benchmark on Facial Text-to-Video Generation

(1) Baselines

To train the baselines, we used their original implementations in our paper:

MMVID
TFGAN

(2) Pretrained Models

Text Descriptions (MMVID)	Link
VQGAN	Google Drive
general & detailed face attributes	Google Drive
emotion	Google Drive
action	Google Drive
light direction	Google Drive
light intensity & color temperature	Google Drive
general face attributes + emotion + action + light direction	Google Drive

More Work May Interest You

There are several our previous publications that might be of interest to you.

Face Generation:
- (ECCV 2022) CelebV-HQ: A Large-scale Video Facial Attributes Dataset. Zhu et al. [Paper], [Project Page], [Dataset]
- (CVPR 2022) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing. Xu et al. [Paper], [Project Page], [Code]
Human Generation:
- (Tech. Report 2022) 3DHumanGAN: Towards Photo-realistic 3D-Aware Human Image Generation. Yang et al. [Paper], [Project Page], [Code]
- (ECCV 2022) StyleGAN-Human: A Data-Centric Odyssey of Human. Fu et al. [Paper], [Project Page], [Dataset]
- (SIGGRAPH 2022) Text2Human: Text-Driven Controllable Human Image Generation. Jiang et al. [Paper], [Project Page], [Code]

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{yu2022celebvtext,
  title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
  author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
  booktitle={CVPR},
  year={2023}
}

Acknowledgement

CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).

celebv-text / CelebV-Text

readme