Code for our CVPR 2023 paper "HOICLIP: Efficient-Knowledge-Transfer-for-HOI-Detection-with-Visual-Linguistic-Model" .
Contributed by Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He.
Install the dependencies.
pip install -r requirements.txt
HICO-DET dataset can be downloaded here. After
finishing downloading, unpack the tarball (hico_20160224_det.tar.gz
) to the data
directory.
Instead of using the original annotations files, we use the annotation files provided by the PPDM authors. The
annotation files can be downloaded from here. The
downloaded annotation files have to be placed as follows.
For fractional data setting, we provide the
annotations here. After
decompress, the files should be placed under data/hico_20160224_det/annotations
.
data
└─ hico_20160224_det
|─ annotations
| |─ trainval_hico.json
| |─ test_hico.json
| |─ corre_hico.json
| |─ trainval_hico_5%.json
| |─ trainval_hico_15%.json
| |─ trainval_hico_25%.json
| └─ trainval_hico_50%.json
:
First clone the repository of V-COCO from here, and then follow the instruction to
generate the file instances_vcoco_all_2014.json
. Next, download the prior file prior.pickle
from here. Place the files and make
directories as follows.
GEN-VLKT
|─ data
│ └─ v-coco
| |─ data
| | |─ instances_vcoco_all_2014.json
| | :
| |─ prior.pickle
| |─ images
| | |─ train2014
| | | |─ COCO_train2014_000000000009.jpg
| | | :
| | └─ val2014
| | |─ COCO_val2014_000000000042.jpg
| | :
| |─ annotations
: :
For our implementation, the annotation file have to be converted to the HOIA format. The conversion can be conducted as follows.
PYTHONPATH=data/v-coco \
python convert_vcoco_annotations.py \
--load_path data/v-coco/data \
--prior_path data/v-coco/prior.pickle \
--save_path data/v-coco/annotations
Note that only Python2 can be used for this conversion because vsrl_utils.py
in the v-coco repository shows a error
with Python3.
V-COCO annotations with the HOIA format, corre_vcoco.npy
, test_vcoco.json
, and trainval_vcoco.json
will be
generated to annotations
directory.
Download the pretrained model of DETR detector for ResNet50
, and put it to the params
directory.
python ./tools/convert_parameters.py \
--load_path params/detr-r50-e632da11.pth \
--save_path params/detr-r50-pre-2branch-hico.pth \
--num_queries 64
python ./tools/convert_parameters.py \
--load_path params/detr-r50-e632da11.pth \
--save_path params/detr-r50-pre-2branch-vcoco.pth \
--dataset vcoco \
--num_queries 64
After the preparation, you can start training with the following commands.
# default setting
sh ./scripts/train_hico.sh
sh ./scripts/train_vcoco.sh
# rare first unseen combination setting
sh ./scripts/train_hico_rf_uc.sh
# non rare first unseen combination setting
sh ./scripts/train_hico_nrf_uc.sh
# unseen object setting
sh ./scripts/train_hico_uo.sh
# unseen verb setting
sh ./scripts/train_hico_uv.sh
# 50% fractional data
sh ./scripts/train_hico_frac.sh
sh ./scripts/generate_verb.sh
We provide the generated verb representation in ./tmp/verb.pth
for hico and ./tmp/vcoco_verb.pth
for vcoco.
You can conduct the evaluation with trained parameters for HICO-DET as follows.
python -m torch.distributed.launch \
--nproc_per_node=2 \
--use_env \
main.py \
--pretrained [path to your checkpoint] \
--dataset_file hico \
--hoi_path data/hico_20160224_det \
--num_obj_classes 80 \
--num_verb_classes 117 \
--backbone resnet50 \
--num_queries 64 \
--dec_layers 3 \
--eval \
--zero_shot_type default \
--with_clip_label \
--with_obj_clip_label \
--use_nms_filter
For the official evaluation (reported in paper), you need to covert the prediction file to an official prediction format following this file, and then follow PPDM evaluation steps.
//: # ()
//: # ()
//: # ()
//: # ()
//: # ()
//: # ()
//: # ()
//: # ()
//: # ()
python -m torch.distributed.launch \
--nproc_per_node=8 \
--use_env \
main.py \
--pretrained [path to your checkpoint] \
--dataset_file hico \
--hoi_path data/hico_20160224_det \
--num_obj_classes 80 \
--num_verb_classes 117 \
--backbone resnet50 \
--num_queries 64 \
--dec_layers 3 \
--eval \
--with_clip_label \
--with_obj_clip_label \
--use_nms_filter \
--zero_shot_type rare_first \
--del_unseen
The Training Free Enhancement
is used when args.training_free_enhancement_path is not empty.
The results are placed in args.output_dir/args.training_free_enhancement_path.
You may refer to codes in engine.py:202
.
By default, we set the topk to [10, 20, 30, 40, 50].
Script for visualization is in scripts/visualization_hico.sh
You may need to adjust the file paths with TODO comment in visualization_hoiclip/gen_vlkt.py
and currently the code
visualize fail cases in some zero-shot setting. For detail information, you may refer to the comments.
Full (D) | Rare (D) | Non-rare (D) | Full(KO) | Rare (KO) | Non-rare (KO) | Download | Conifg | |
---|---|---|---|---|---|---|---|---|
HOICLIP | 34.69 | 31.12 | 35.74 | 37.61 | 34.47 | 38.54 | model | config |
D: Default, KO: Known object. The best result is achieved with training free enhancement (topk=10).
Fractional | Full | Rare | Non-rare | Config | |
---|---|---|---|---|---|
HOICLIP | 5% | 22.64 | 21.94 | 24.28 | config |
HOICLIP | 15% | 27.07 | 24.59 | 29.38 | config |
HOICLIP | 25% | 28.44 | 25.47 | 30.52 | config |
HOICLIP | 50% | 30.88 | 26.05 | 32.97 | config |
You may need to change the --frac [portion]%
in the scripts.
Scenario 1 | Scenario 2 | Download | Config | |
---|---|---|---|---|
HOICLIP | 63.50 | 64.81 | model | config |
Type | Unseen | Seen | Full | Download | Conifg | |
---|---|---|---|---|---|---|
HOICLIP | RF-UC | 25.53 | 34.85 | 32.99 | model | config |
HOICLIP | NF-UC | 26.39 | 28.10 | 27.75 | model | config |
HOICLIP | UO | 16.20 | 30.99 | 28.53 | model | config |
HOICLIP | UV | 24.30 | 32.19 | 31.09 | model | config |
We also provide the checkpoints for uc0, uc1, uc2, uc3 settings in Google Drive
Please consider citing our paper if it helps your research.
@inproceedings{ning2023hoiclip,
title={HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models},
author={Ning, Shan and Qiu, Longtian and Liu, Yongfei and He, Xuming},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={23507--23517},
year={2023}
}
Codes are built from GEN-VLKT, PPDM , DETR, QPIC and CDN. We thank them for their contributions.