Artanic30 / HOICLIP

CVPR 2023 Accepted Paper HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models
58 stars 7 forks source link

HOICLIP: Efficient-Knowledge-Transfer-for-HOI-Detection-with-Visual-Linguistic-Model

Code for our CVPR 2023 paper "HOICLIP: Efficient-Knowledge-Transfer-for-HOI-Detection-with-Visual-Linguistic-Model" .

Contributed by Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He.

Installation

Install the dependencies.

pip install -r requirements.txt

Data preparation

HICO-DET

HICO-DET dataset can be downloaded here. After finishing downloading, unpack the tarball (hico_20160224_det.tar.gz) to the data directory.

Instead of using the original annotations files, we use the annotation files provided by the PPDM authors. The annotation files can be downloaded from here. The downloaded annotation files have to be placed as follows. For fractional data setting, we provide the annotations here. After decompress, the files should be placed under data/hico_20160224_det/annotations.

data
 └─ hico_20160224_det
     |─ annotations
     |   |─ trainval_hico.json
     |   |─ test_hico.json
     |   |─ corre_hico.json
     |   |─ trainval_hico_5%.json
     |   |─ trainval_hico_15%.json
     |   |─ trainval_hico_25%.json
     |   └─ trainval_hico_50%.json
     :

V-COCO

First clone the repository of V-COCO from here, and then follow the instruction to generate the file instances_vcoco_all_2014.json. Next, download the prior file prior.pickle from here. Place the files and make directories as follows.

GEN-VLKT
 |─ data
 │   └─ v-coco
 |       |─ data
 |       |   |─ instances_vcoco_all_2014.json
 |       |   :
 |       |─ prior.pickle
 |       |─ images
 |       |   |─ train2014
 |       |   |   |─ COCO_train2014_000000000009.jpg
 |       |   |   :
 |       |   └─ val2014
 |       |       |─ COCO_val2014_000000000042.jpg
 |       |       :
 |       |─ annotations
 :       :

For our implementation, the annotation file have to be converted to the HOIA format. The conversion can be conducted as follows.

PYTHONPATH=data/v-coco \
        python convert_vcoco_annotations.py \
        --load_path data/v-coco/data \
        --prior_path data/v-coco/prior.pickle \
        --save_path data/v-coco/annotations

Note that only Python2 can be used for this conversion because vsrl_utils.py in the v-coco repository shows a error with Python3.

V-COCO annotations with the HOIA format, corre_vcoco.npy, test_vcoco.json, and trainval_vcoco.json will be generated to annotations directory.

Pre-trained model

Download the pretrained model of DETR detector for ResNet50 , and put it to the params directory.

python ./tools/convert_parameters.py \
        --load_path params/detr-r50-e632da11.pth \
        --save_path params/detr-r50-pre-2branch-hico.pth \
        --num_queries 64

python ./tools/convert_parameters.py \
        --load_path params/detr-r50-e632da11.pth \
        --save_path params/detr-r50-pre-2branch-vcoco.pth \
        --dataset vcoco \
        --num_queries 64

Training

After the preparation, you can start training with the following commands.

HICO-DET

# default setting
sh ./scripts/train_hico.sh

V-COCO

sh ./scripts/train_vcoco.sh

Zero-shot

# rare first unseen combination setting
sh ./scripts/train_hico_rf_uc.sh
# non rare first unseen combination setting
sh ./scripts/train_hico_nrf_uc.sh
# unseen object setting
sh ./scripts/train_hico_uo.sh
# unseen verb setting
sh ./scripts/train_hico_uv.sh

Fractional data

# 50% fractional data
sh ./scripts/train_hico_frac.sh

Generate verb representation for Visual Semantic Arithmetic

sh ./scripts/generate_verb.sh

We provide the generated verb representation in ./tmp/verb.pth for hico and ./tmp/vcoco_verb.pth for vcoco.

Evaluation

HICO-DET

You can conduct the evaluation with trained parameters for HICO-DET as follows.

python -m torch.distributed.launch \
        --nproc_per_node=2 \
        --use_env \
        main.py \
        --pretrained [path to your checkpoint] \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers 3 \
        --eval \
        --zero_shot_type default \
        --with_clip_label \
        --with_obj_clip_label \
        --use_nms_filter

For the official evaluation (reported in paper), you need to covert the prediction file to an official prediction format following this file, and then follow PPDM evaluation steps.

//: # ()

//: # ()

//: # ()

//: # ()

//: # ()

//: # ()

//: # ()

//: # ()

//: # ()

Zero-shot

python -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained [path to your checkpoint] \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers 3 \
        --eval \
        --with_clip_label \
        --with_obj_clip_label \
        --use_nms_filter \
        --zero_shot_type rare_first \
        --del_unseen

Training Free Enhancement

The Training Free Enhancement is used when args.training_free_enhancement_path is not empty. The results are placed in args.output_dir/args.training_free_enhancement_path. You may refer to codes in engine.py:202. By default, we set the topk to [10, 20, 30, 40, 50].

Visualization

Script for visualization is in scripts/visualization_hico.sh You may need to adjust the file paths with TODO comment in visualization_hoiclip/gen_vlkt.py and currently the code visualize fail cases in some zero-shot setting. For detail information, you may refer to the comments.

Regular HOI Detection Results

HICO-DET

Full (D) Rare (D) Non-rare (D) Full(KO) Rare (KO) Non-rare (KO) Download Conifg
HOICLIP 34.69 31.12 35.74 37.61 34.47 38.54 model config

D: Default, KO: Known object. The best result is achieved with training free enhancement (topk=10).

HICO-DET Fractional Setting

Fractional Full Rare Non-rare Config
HOICLIP 5% 22.64 21.94 24.28 config
HOICLIP 15% 27.07 24.59 29.38 config
HOICLIP 25% 28.44 25.47 30.52 config
HOICLIP 50% 30.88 26.05 32.97 config

You may need to change the --frac [portion]% in the scripts.

V-COCO

Scenario 1 Scenario 2 Download Config
HOICLIP 63.50 64.81 model config

Zero-shot HOI Detection Results

Type Unseen Seen Full Download Conifg
HOICLIP RF-UC 25.53 34.85 32.99 model config
HOICLIP NF-UC 26.39 28.10 27.75 model config
HOICLIP UO 16.20 30.99 28.53 model config
HOICLIP UV 24.30 32.19 31.09 model config

We also provide the checkpoints for uc0, uc1, uc2, uc3 settings in Google Drive

Citation

Please consider citing our paper if it helps your research.

@inproceedings{ning2023hoiclip,
  title={HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models},
  author={Ning, Shan and Qiu, Longtian and Liu, Yongfei and He, Xuming},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={23507--23517},
  year={2023}
}

Acknowledge

Codes are built from GEN-VLKT, PPDM , DETR, QPIC and CDN. We thank them for their contributions.

Release Schedule