LAION-Face is the human face subset of LAION-400M, it consists of 50 million image-text pairs. Face detection is conducted to find images with faces. Apart from the 50 million full-set(LAION-Face 50M), we also provide a 20 million sub-set(LAION-Face 20M) for fast evaluation.
LAION-Face is first used as the training set of FaRL, which provides powerful pre-training transformer backbones for face analysis tasks.
For now, we only provide the image id list of those contains human face, you need download the images by yourself following the instructions below. We will further provide the face detection metadata.
pip install -r requirements.txt
We need pyarrow
to read and write parque file, img2dataset
to download images.
We provide the list of sample_id in huggingface.
Download and convert the metadata with the following commands.
wget -l1 -r --no-parent https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/
mv the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ .
wget https://huggingface.co/datasets/FacePerceiver/laion-face/resolve/main/laion_face_ids.pth
python convert_parquet.py ./laion_face_ids.pth ./laion400m-meta ./laion_face_meta
When metadata is ready, you can start download the images.
bash download.sh ./laion_face_meta ./laion_face_data
Please be patient, this command might run over days, and cost about 2T disk space, and it will download 50 million image-text pairs as 32 parts.
0,2,5,8,13,15,17,18,21,22,24,25,28
checkout download.sh
and img2dataset for more details and parameter setting.
We use batch-face to detect faces on the images, here we provide the face detection result of each sample.
To download the detection result, use the following command.
bash download_detection.sh ./detection_metadata
it will download 32 sample2detect.pth
to the detection_metadata
, cost about 30G disk space, each corresponding to a part as in last section.
Each pth is a dict object, it's key is int(SAMPLE_ID), and the value is the face detection result.
To get the face detection result of single image, you can refer to the code snippet below.
import torch
part_index=0
SAMPLE_ID=int(SAMPLE_ID) # you can get it from the parquet file generated by the img2dataset
sample2detect=torch.load(f"detection_metadata/sample2detect_{part_index}.pth") # each part has a sample2detect pth, its a dict
faces=sample2detect[SAMPLE_ID]
box, landmarks, score = faces[0] # face rectangle, the standard five points, confidence
LAION-Face is the face subset of LAION-400M, we distribute the image id list (the pth files) under the most open Creative Common CC-BY 4.0 license, which poses no particular restriction. The metadata of the dataset are from LAION-400M. Please check LAION-400M for more details.
For help or issues concerning the data, feel free to submit a GitHub issue, or contact Yinglin Zheng.
If you find our work helpful, please consider citing
@inproceedings{zheng2022general,
title={General facial representation learning in a visual-linguistic manner},
author={Zheng, Yinglin and Yang, Hao and Zhang, Ting and Bao, Jianmin and Chen, Dongdong and Huang, Yangyu and Yuan, Lu and Chen, Dong and Zeng, Ming and Wen, Fang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18697--18709},
year={2022}
}