CelebV-Text: A large-Scale Facial Text-Video Dataset
Jianhui Yu*,
Hao Zhu*,
Liming Jiang,
Chen Change Loy,
Weidong Cai,
and Wayne Wu
(*Equal contribution)
Demo Video | Project Page
| Paper (arxiv)
Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.
The distributions of each attribute. CelebV-Text contains 70,000 video clips with a total duration of around 279 hours. Each video is accompanied by 20 sentences describing 6 designed attributes, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.
This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.
Description | Link |
---|---|
general & detailed face attributes | Google Drive |
emotion | Google Drive |
action | Google Drive |
light direction | Google Drive |
light intensity | Google Drive |
light color temperature | Google Drive |
*metadata annotation | Google Drive |
Prepare the environment & Run script:
# prepare the environment
pip install youtube_dl
pip install opencv-python
# you can change the download folder in the code
python download_and_process.py
{
"clips":
{
"0-5BrmyFsYM_0": // clip 1
{
"ytb_id": "0-5BrmyFsYM", // youtube id
"duration": {"start_sec": 0.0, "end_sec": 9.64}, // start and end times in the original video
"bbox": {"top": 0, "bottom": 937, "left": 849, "right": 1872}, // bounding box
"version": "v0.1"
},
"00-30GQl0TM_7": // clip 2
{
"ytb_id": "00-30GQl0TM", // youtube id
"duration": {"start_frame": 415.29, "end_frame": 420.88}, // start and end times in the original video
"bbox": {"top": 0, "bottom": 1183, "left": 665, "right": 1956}, // bounding box
"version": "v0.1"
},
"..."
"..."
}
}
To train the baselines, we used their original implementations in our paper:
Text Descriptions (MMVID) | Link |
---|---|
VQGAN | Google Drive |
general & detailed face attributes | Google Drive |
emotion | Google Drive |
action | Google Drive |
light direction | Google Drive |
light intensity & color temperature | Google Drive |
general face attributes + emotion + action + light direction | Google Drive |
There are several our previous publications that might be of interest to you.
Face Generation:
Human Generation:
If you find this work useful for your research, please consider citing our paper:
@inproceedings{yu2022celebvtext,
title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
booktitle={CVPR},
year={2023}
}
CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).