We present HIPIE, a novel HIerarchical, oPen-vocabulary and unIvErsal image segmentation and detection model that is capable of performing segmentation tasks at various levels of granularities (whole, part and subpart) and tasks, including semantic segmentation, instance segmentation, panoptic segmentation, referring segmentation, and part/subpart segmentation, all within a unified framework of language-guided segmentation.
Hierarchical Open-vocabulary Universal Image Segmentation
Xudong Wang*, Shufan Li*, Konstantinos Kallidromitis*, Yusuke Kato, Kazuki Kozuka, Trevor Darrell
Berkeley AI Research, UC Berkeley; Panasonic AI Research
NeurIPS 2023
[project page
] [arxiv
] [paper
] [bibtex
]
Oct 5: We release more weights, and codes for training and evaluation
Oct 15: We release additional Vit-H weights finetuned for part segmentation
Please refer to INSTALL.md for more details.
HIPIE is also capable of labeling segmentation masks from SAM and can even identify additional masks that may have been overlooked by SAM.
Please check our project page for more demos!
We release the following checkpoints at the moment.
The following code will train model on one node with 8 A100 GPUS
python3 launch.py --nn 1 --np 8 --uni 1 --config-file projects/HIPIE/configs/<config file> MODEL.WEIGHTS <pretrained checkpoint>
The following code will evaluate model on one node with 8 A100 GPUS
python3 launch.py --nn 1 --np 8 --uni 1 --config-file projects/HIPIE/configs/<config file> --eval-only MODEL.WEIGHTS < checkpoint to load>
Alternatively, one can run
python3 launch.py --nn 1 --np 8 --uni 1 --config-file projects/HIPIE/configs/<config file> --eval-only --resume OUTPUT_DIR < folder with checkpoints >
with released weights, on should be able to reproduce following results
Data | COCO | ADE-150 | RefCOCO | RefCOCO+ | RefCOCOg | PAS-21 | CTX-59 | CTX-459 | ADE-874 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AP_bbox | AP_segm | MIoU | PQ | AP_bbox | AP_segm | MIoU | PQ | oIoU | oIoU | oIoU | MIoU | MIoU | MIoU | MIoU | |
O365, COCO, RefCOCO/+/g,PACO | 60.4 | 51.1 | 65.6 | 57.0 | 23.0 | 19.1 | 24.3 | 21.0 | 81.5 | 71.5 | 74.3 | 81.1 | 57.4 | 14.4 | 9.7 |
O365*, COCO, RefCOCO/+/g | 61.3 | 51.9 | 66.8 | 58.1 | 18.4 | 14.9 | 28.4 | 20.1 | 82.8 | 73.9 | 75.7 | 83.2 | 58.1 | 11.1 | 10.8 |
* Used only in pretraing, but not in final training.
** Note on high variance: We observe that evaluation metrics can have high variances, this is likely due to the noise of using CLIP MODEL. Specifically, changing the MODEL.CLIP.ALPHA
and MODEL.CLIP.BETA
which determines the importances of CLIP feature versus encoder feature can drastically change the results. It is possible to improve on individual benchmark by tuning these parameters.
The finetuned part segmentation model should be able to produce the following result
COCO | RefCOCO | Pascal-Parts |
---|---|---|
PQ | oIoU | MIoU-PastS |
55.3 | 78.1 | 64.4 |
The majority of HIPIE is licensed under the MIT license. If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0.
If you have any general questions, feel free to email us at Xudong Wang, Shufan Li and Konstantinos Kallidromitis. If you have code or implementation-related questions, please feel free to send emails to us or open an issue in this codebase (We recommend that you open an issue in this codebase, because your questions may help others).
If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.
@inproceedings{wang2023hierarchical,
title={Hierarchical Open-vocabulary Universal Image Segmentation},
author={Wang, Xudong and Li, Shufan and Kallidromitis, Konstantinos and Kato, Yusuke and Kozuka, Kazuki and Darrell, Trevor},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}