This repository contains the official PyTorch implementation for the paper
Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Gould; Exploring Predicate Visual Context for Detecting Human-Object Interactions; In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 10411-10421.
Recently, the DETR framework has emerged as the dominant approach for human–object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
# Say you are using Conda
conda create --name pvic python=3.8
conda activate pvic
# Required dependencies
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install matplotlib==3.6.3 scipy==1.10.0 tqdm==4.64.1
pip install numpy==1.24.1 timm==0.6.12
pip install wandb==0.13.9 seaborn==0.13.0
# Clone the repo and submodules
git clone https://github.com/fredzzhang/pvic.git
cd pvic
git submodule init
git submodule update
pip install -e pocket
# Build CUDA operator for MultiScaleDeformableAttention
cd h_detr/models/ops
python setup.py build install
cd /path/to/pvic/hicodet
bash download.sh
cd /path/to/pvic/hicodet
ln -s /path/to/hicodet_20160224_det ./hico_20160224_det
cd /path/to/pvic/vcoco
bash download.sh
cd /path/to/pvic/vcoco
ln -s /path/to/coco ./mscoco2014
Visualisation utilities are implemented to run inference on a single image and visualise the cross-attention weights. A reference model is provided for demonstration purpose if you don't want to train a model yourself. Download the model and save it to ./checkpoints/
. Use the argument --index
to select images and --action
to specify the action index. Refer to the lookup table for action indices.
DETR=base python inference.py --resume checkpoints/pvic-detr-r50-hicodet.pth --index 4050 --action 111
The detected human-object pairs with scores overlayed are saved to fig.png
, while the attention weights are saved to pair_xx_attn_head_x.png
. Below are some sample outputs.
In addition, the argument --image-path
enables inference on custom images.
Refer to the documentation for model checkpoints and training/testing commands.
PViC is released under the BSD-3-Clause License.
If you find our work useful for your research, please consider citing us:
@inproceedings{zhang2023pvic,
author = {Zhang, Frederic Z. and Yuan, Yuhui and Campbell, Dylan and Zhong, Zhuoyao and Gould, Stephen},
title = {Exploring Predicate Visual Context in Detecting Human–Object Interactions},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {10411-10421},
}
@inproceedings{zhang2022upt,
author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
title = {Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {20104-20112}
}
@inproceedings{zhang2021scg,
author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
title = {Spatially Conditioned Graphs for Detecting Human–Object Interactions},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {13319-13327}
}