OpenGVLab/Hulk - Githubissues

Hulk: A Universal Knowledge Translator for Human-centric Tasks

[Yizhou Wang](https://scholar.google.com/citations?user=CQGaGMAAAAAJ&hl=zh-CN&authuser=1)^1*, [Yixuan Wu](https://scholar.google.com/citations?user=zjAxJcwAAAAJ&hl=en&oi=ao)^1*,2, [Shixiang Tang](https://github.com/tangshixiang)^{1 :email:}, [Weizhen He]()^2,3, [Xun Guo](https://github.com/Space-Xun)^1,4, [Feng Zhu](https://zhufengx.github.io/)³, [Lei Bai](http://leibai.site/)¹, [Rui Zhao](http://zhaorui.xyz/)³, [Jian Wu]()², [Tong He](http://tonghe90.github.io/)¹, [Wanli Ouyang](https://wlouyang.github.io/)¹ ¹[Shanghai AI Lab](https://www.shlab.org.cn/), ²[ZJU](https://www.zju.edu.cn/), ³[SenseTime](https://www.sensetime.com), ⁴[USTC](https://www.ustc.edu.cn/) [ArXiv](https://arxiv.org/abs/2312.01697) | [Project Page](https://humancentricmodels.github.io/Hulk/) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/pose-estimation-on-aic)](https://paperswithcode.com/sota/pose-estimation-on-aic?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/human-part-segmentation-on-cihp)](https://paperswithcode.com/sota/human-part-segmentation-on-cihp?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/skeleton-based-action-recognition-on-ntu-rgbd)](https://paperswithcode.com/sota/skeleton-based-action-recognition-on-ntu-rgbd?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/semantic-segmentation-on-lip-val)](https://paperswithcode.com/sota/semantic-segmentation-on-lip-val?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/human-part-segmentation-on-human3-6m)](https://paperswithcode.com/sota/human-part-segmentation-on-human3-6m?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/pedestrian-attribute-recognition-on-rapv2)](https://paperswithcode.com/sota/pedestrian-attribute-recognition-on-rapv2?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/pedestrian-attribute-recognition-on-pa-100k)](https://paperswithcode.com/sota/pedestrian-attribute-recognition-on-pa-100k?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/pose-estimation-on-coco)](https://paperswithcode.com/sota/pose-estimation-on-coco?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/object-detection-on-crowdhuman-full-body)](https://paperswithcode.com/sota/object-detection-on-crowdhuman-full-body?p=hulk-a-universal-knowledge-translator-for)

Welcome to Hulk! Hulk is a multimodel human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language human-centric tasks. Unlike many existing human-centric foundation models that did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning, Hulk condensed various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. Unifying these tasks enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. For more details, please take a look at our paper Hulk: A Universal Knowledge Translator for Human-centric Tasks.

News

Apr. 2024 A pretrained Hulk is released on 🤗 Hugging Face Models!
Apr. 2024 Project page with demos is released at Hulk.
Mar. 2024 Training and inference code are released!
Dec. 2023 Hulk is released on ArXiv!

Installation

This codebase has been developed with python version 3.9, pytorch 2.0.0, cuda 11.8 and torchvision 0.15.0. We recommend using the same version to avoid potential issues.

pip install -r requirements.txt

Also, download bert-base-uncased from huggingface and put it under experiments/release/.

Datasets

Please refer to the datasets for more details.

Training

Download pre-trained MAE weights from here and put it under core/models/backbones/pretrain_weights/.

We use 10 nodes (80 A100 GPUs) for training with the following command:

cd experiments/release
sh train.sh 80 Hulk_vit-B

Evaluation

A pretrained Hulk will be soon available at 🤗 Hugging Face Models. Download it, put it under the folder experiments/release/checkpoints/Hulk_vit-B (first mkdir -p experiments/release/checkpoints/Hulk_vit-B), then use the following command to evaluate the model on the test set.

cd experiments/release
sh batch_eval.sh 1 Hulk_vit-B

Model Performance

We use the plain ViT as our backbone, develop four modality-specific tokenizers and de-tokenizers to cover 2D vision, 3D vision, skeleton-based, and vision-language human-centric tasks. Hulk has achieved state-of-the-art results on various human-centric tasks.

Direct Evaluation

Task	pedestrian detection			2D pose		skeleton-based action	human parsing			attribute recognition		image caption	monocular 3D human pose and mesh recovery
Dataset	CrowdHuman			COCO	AIC	NTU60-XSub	H3.6M	LIP	CIHP	PA-100k	RAPv2	CUHK-PEDES	3DPW			H3.6M
Metric	mAP	MR^-2	JI	AP	AP	acc.	mIoU	mIoU	mIoU	mA	mA	B@4	MPVPE↓	MPJPE↓	PA-MPJPE↓	MPJPE↓	PA-MPJPE↓
Hulk (ViT-B)	90.7	43.8	84.0	77.0	34.5	93.8	68.08	63.95	70.58	82.85	80.90	31.1	79.8	67.0	39.9	43.6	31.9
Hulk (ViT-L)	92.2	40.1	85.8	78.3	36.3	94.1	69.31	65.86	72.33	84.36	82.85	31.6	77.4	66.3	38.5	40.3	28.8

Finetune Performance

Task	pedestrian detection			2D pose		skeleton-based action	human parsing			attribute recognition		image caption ♣	monocular 3D human pose and mesh recovery ♣
Dataset	CrowdHuman			COCO	AIC	NTU60-XSub	H3.6M	LIP	CIHP	PA-100k	RAPv2	CUHK-PEDES	3DPW			H3.6M
Metric	mAP	MR^-2	JI	AP	AP	acc.	mIoU	mIoU	mIoU	mA	mA	B@4	MPVPE↓	MPJPE↓	PA-MPJPE↓	MPJPE↓	PA-MPJPE↓
Hulk (ViT-B)	92.4	40.7	86.0	77.5	35.6	94.0	68.56	63.98	71.26	87.85	85.26	28.3	80.7	68.9	41.3	44.9	32.0
Hulk (ViT-L)	93.0	36.5	87.0	78.7	37.1	94.3	69.89	66.02	72.68	88.97	85.86	30.5	79.9	68.3	40.6	41.4	30.2

♣: We find that the performance of image caption and monocular 3D human pose and mesh recovery is not as good as the direct evaluation, indicating that overfitting may occur during finetuning.

Contact

If you have any problem about our paper & code, feel free to contact Yizhou Wang and Yixuan Wu.

Citation

If you find this work useful, please consider citing:

@article{wang2023hulk,
  title={Hulk: A Universal Knowledge Translator for Human-Centric Tasks},
  author={Wang, Yizhou and Wu, Yixuan and Tang, Shixiang and He, Weizhen and Guo, Xun and Zhu, Feng and Bai, Lei and Zhao, Rui and Wu, Jian and He, Tong and others},
  journal={arXiv preprint arXiv:2312.01697},
  year={2023}
}

OpenGVLab / Hulk

readme