celebv-text / CelebV-Text

(CVPR 2023) CelebV-Text: A Large-Scale Facial Text-Video Dataset
https://celebv-text.github.io/
388 stars 33 forks source link

CelebV-Text: A Large-Scale Facial Text-Video Dataset (CVPR 2023)

CelebV-Text: A large-Scale Facial Text-Video Dataset
Jianhui Yu*, Hao Zhu*, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu
(*Equal contribution)
Demo Video | Project Page | Paper (arxiv)

Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.

Updates

Table of contents

TODO

Dataset Statistics

https://user-images.githubusercontent.com/10545746/227458030-fbb48f66-db14-4c89-a001-4d7cdd29b248.mp4

The distributions of each attribute. CelebV-Text contains 70,000 video clips with a total duration of around 279 hours. Each video is accompanied by 20 sentences describing 6 designed attributes, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.

video stats text stats text-video rel

Visual ChatGPT Demo

This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.

https://user-images.githubusercontent.com/10545746/226870355-83c9c875-0e3b-439e-9df7-453d8e408807.mp4

Agreement

Dataset Download

(1) Text Descriptions & Metadata Annotation

Description Link
general & detailed face attributes Google Drive
emotion Google Drive
action Google Drive
light direction Google Drive
light intensity Google Drive
light color temperature Google Drive
*metadata annotation Google Drive

(2) Video Download Pipeline

Prepare the environment & Run script:

# prepare the environment
pip install youtube_dl
pip install opencv-python

# you can change the download folder in the code 
python download_and_process.py
JSON File Structure:
{
    "clips":
    {
        "0-5BrmyFsYM_0":  // clip 1 
        {
            "ytb_id": "0-5BrmyFsYM",                                        // youtube id
            "duration": {"start_sec": 0.0, "end_sec": 9.64},                // start and end times in the original video
            "bbox": {"top": 0, "bottom": 937, "left": 849, "right": 1872},  // bounding box
            "version": "v0.1"
        },

        "00-30GQl0TM_7":  // clip 2 
        {
            "ytb_id": "00-30GQl0TM",                                        // youtube id
            "duration": {"start_frame": 415.29, "end_frame": 420.88},       // start and end times in the original video
            "bbox": {"top": 0, "bottom": 1183, "left": 665, "right": 1956}, // bounding box
            "version": "v0.1"
        },
        "..."
        "..."

    }
}

Benchmark on Facial Text-to-Video Generation

(1) Baselines

To train the baselines, we used their original implementations in our paper:

(2) Pretrained Models

Text Descriptions (MMVID) Link
VQGAN Google Drive
general & detailed face attributes Google Drive
emotion Google Drive
action Google Drive
light direction Google Drive
light intensity & color temperature Google Drive
general face attributes + emotion + action + light direction Google Drive

More Work May Interest You

There are several our previous publications that might be of interest to you.

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{yu2022celebvtext,
  title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
  author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
  booktitle={CVPR},
  year={2023}
}

Acknowledgement

CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).