Project Page
](https://yangjie-cv.github.io/X-Pose/) | [Paper
](http://arxiv.org/abs/2310.08530) | [UniKPT Dataset
](https://drive.google.com/file/d/1ukLPbTpTfrCQvRY2jY52CgRi-xqvyIsP/view) |[Video
](https://github.com/IDEA-Research/UniPose)
[Jie Yang1,2](https://yangjie-cv.github.io/), [Ailing Zeng1](https://ailingzeng.site/), [Ruimao Zhang2](http://www.zhangruimao.site/), [Lei Zhang1](https://www.leizhang.org/)
1[International Digital Economy Academy](https://www.idea.edu.cn/research/cvr.html) 2[The Chinese University of Hong Kong, Shenzhen](https://www.cuhk.edu.cn/en)
2024.07.12: X-Pose supports controllable animal face animation. See details here.
2024.07.02: X-Pose is accepted to ECCV24 (We changed the model name from UniPose to X-Pose to avoid confusion with similarly named previous works).
2024.02.14: We update a file to highlight all classes (1237 classes) in the UNIKPT dataset.
2023.11.28: We are excited to highlight the 68 face keypoints detection ability of X-Pose across any categories in this figure. The definition of face keypoints follows this dataset.
2023.11.9: Thanks to OpenXLab, you can try a quick online demo. Looking forward to the feedback!
2023.11.1: We release the inference code, demo, checkpoints, and the annotation of the UniKPT dataset.
2023.10.13: We release the arxiv version.
X-Pose has strong fine-grained localization and generalization abilities across image styles, categories, and poses.
• X-Pose is the first end-to-end prompt-based keypoint detection framework.
• It supports multi-modality prompts, including textual and visual prompts to detect arbitrary keypoints (e.g., from articulated, rigid, and soft objects).
Clone this repo
git clone https://github.com/IDEA-Rensearch/X-Pose.git
cd X-Pose
Install the needed packages
pip install -r requirements.txt
Compiling CUDA operators
cd models/UniPose/ops
python setup.py build install
# unit test (should see all checking is True)
python test.py
cd ../../..
• We have released the textual prompt-based branch for inference. As the visual prompt involves a substantial amount of user input, we are currently exploring more user-friendly platforms to support this functionality.
• Since X-Pose has learned strong structural prior, it's best to use the predefined skeleton as the keypoint textual prompts, which are shown in predefined_keypoints.py.
• If users don't provide a keypoint prompt, we'll try to match the appropriate skeleton based on the user's instance category. If unsuccessful, we'll default to using the animal's skeleton, which covers a wider range of categories and testing requirements.
Replace {GPU ID}
, image_you_want_to_test.jpg
, and "dir you want to save the output"
with appropriate values in the following command
CUDA_VISIBLE_DEVICES={GPU ID} python inference_on_a_image.py \
-c config/UniPose_SwinT.py \
-p weights/unipose_swint.pth \
-i image_you_want_to_test.jpg \
-o "dir you want to save the output" \
-t "instance categories" \ (e.g., "person", "face", "left hand", "horse", "car", "skirt", "table")
-k "keypoint_skeleton_text" (If necessary, please select an option from the 'predefined_keypoints.py' file.)
We also support the inference using gradio.
python app.py
name | backbone | Keypoint AP on COCO | Checkpoint | Config | |
---|---|---|---|---|---|
1 | X-Pose | Swin-T | 74.4 | Google Drive / OpenXLab | GitHub Link |
2 | X-Pose | Swin-L | 76.8 | Coming Soon | Coming Soon |
Datasets | KPT | Class | Images | Instances | Unify Images | Unify Instance |
---|---|---|---|---|---|---|
COCO | 17 | 1 | 58,945 | 156,165 | 58,945 | 156,165 |
300W-Face | 68 | 1 | 3,837 | 4,437 | 3,837 | 4,437 |
OneHand10K | 21 | 1 | 11,703 | 11,289 | 2,000 | 2000 |
Human-Art | 17 | 1 | 50,000 | 123,131 | 50,000 | 123,131 |
AP-10K | 17 | 54 | 10,015 | 13,028 | 10,015 | 13,028 |
APT-36K | 17 | 30 | 36,000 | 53,006 | 36,000 | 53,006 |
MacaquePose | 17 | 1 | 13,083 | 16,393 | 2,000 | 2,320 |
Animal Kingdom | 23 | 850 | 33,099 | 33,099 | 33,099 | 33,099 |
AnimalWeb | 9 | 332 | 22,451 | 21,921 | 22,451 | 21,921 |
Vinegar Fly | 31 | 1 | 1,500 | 1,500 | 1,500 | 1,500 |
Desert Locust | 34 | 1 | 700 | 700 | 700 | 700 |
Keypoint-5 | 55/31 | 5 | 8,649 | 8,649 | 2,000 | 2,000 |
MP-100 | 561/293 | 100 | 16,943 | 18,000 | 16,943 | 18,000 |
UniKPT | 338 | 1237 | - | - | 226,547 | 418,487 |
• UniKPT is a unified dataset from 13 existing datasets, which is only for non-commercial research purposes.
• All images included in the UniKPT dataset originate from the datasets listed in the table above. To access these images, please download them from the original repository.
• We provide the annotations with precise keypoints' textual descriptions for effective training. More conveniently, you can find the text annotations in the link.
If you find this repository useful for your work, please consider citing it as follows:
@article{xpose,
title={X-Pose: Detection Any Keypoints},
author={Yang, Jie and Zeng, Ailing and Zhang, Ruimao and Zhang, Lei},
journal={ECCV},
year={2024}
}
@inproceedings{yang2023neural,
title={Neural Interactive Keypoint Detection},
author={Yang, Jie and Zeng, Ailing and Li, Feng and Liu, Shilong and Zhang, Ruimao and Zhang, Lei},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={15122--15132},
year={2023}
}
@inproceedings{yang2022explicit,
title={Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation},
author={Yang, Jie and Zeng, Ailing and Liu, Shilong and Li, Feng and Zhang, Ruimao and Zhang, Lei},
booktitle={The Eleventh International Conference on Learning Representations},
year={2022}
}