SHTUPLUS / Pix2Grp_CVPR2024

BSD 3-Clause "New" or "Revised" License
30 stars 2 forks source link

Official Implementation of "From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models"

Table of Contents

Introduction

Our paper "From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models" has been accepted by CVPR 2024.

Installation

  1. Creating conda environment and install pytorch
conda create -n pix2sgg python=3.8
conda activate pix2sgg

# CUDA 11.8
conda install pytorch==2.0.0 torchvision==0.15.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# or CUDA 10.2
conda install pytorch==1.10.2 torchvision==0.11.3 cudatoolkit=10.2 -c pytorch
  1. Install other dependencies:

    pip install -r requirements_pix2sgg.txt
    # the hugging face version: v4.29.2

    Our work is built upon LAVIS, sharing the majority of its requirements.

  2. Build Project

    python setup.py build develop

Datasets

Check DATASET.md for instructions of dataset preprocessing.

Model Zoo

Open Vocabulary SGG

The model weight can be download from: https://huggingface.co/rj979797/PGSG-CVPR2024/tree/main

Novel+base Novel checkpoint
Datasets mR50/100 R50/100 mR50/100
VG 6.2/8.3 15.1/18.4 3.7/5.2 vg_ov_sgg.pth
VG-SGCls 9.7/13.8 26.8/33.2 5.1/7.7 vg_ov_sgg.pth
PSG 15.3/17.7 23.7/25.4 6.7/9.6 psg_ov_sgg.pth
<!-- OIv6 -->

Close Vocabulary SGG

Datasets mR50/100 R50/100 checkpoint
VG 9.0/11.5 17.7/ 20.7 vg_sgg.pth
PSG 14.5/17.6 25.8/28.9 psg_sgg.pth
VG-c 10.4/12.7 20.3/23.6 vg_sgg_close_clser.pth
PSG-c 21.2/22.0 34.9/36.1 psg_sgg_close_clser.pth
<!-- OIv6 -->

Training and Evaluation

Our PGSG is trained using the BLIP pre-trained weights, accessible here.

Ensure that the checkpoint path in the configuration file (*.yaml) is accurate before training or evaluation. During training, utilize the checkpoint specified by model.pretrained, while for evaluation, load the checkpoint specified by model.finetuned.

VG dataset

Open Vocabulary SGG

Training

python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py  lavis/projects/blip/train/vrd_vg_ft_pgsg_ov.yaml --job-name VG-pgsg_ovsgg

Evaluation

python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_vg_pgsg_eval_ov.yaml --job-name VG-pgsg_stdsgg-eval 

Standard SGG

Training

python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py  lavis/projects/blip/train/vrd_vg_ft_pgsg.yaml --job-name VG-pgsg_ovsgg

Evaluation

python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_vg_pgsg_eval.yaml --job-name VG-pgsg_stdsgg-eval 

PSG dataset

Open Vocabulary SGG

Training

python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py --cfg-path lavis/projects/blip/train/vrd_psg_ft_pgsg_ov.yaml --job-name psg-pgsg_ovsgg

Evaluation

python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_psg_ov.yaml --job-name psg-pgsg_ovsgg-eval 

Standard SGG

Training

python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py --cfg-path lavis/projects/blip/train/vrd_psg_ft_pgsg.yaml --job-name psg-pgsg_stdsgg

Evaluation

python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_psg_eval.yaml --job-name psg-pgsg_stdsgg-eval 

Paper and Citing

If you find this project helps your research, please kindly consider citing our papers in your publications.

@misc{li2024pixels,
    title={From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models},
    author={Rongjie Li and Songyang Zhang and Dahua Lin and Kai Chen and Xuming He},
    year={2024},
    eprint={2404.00906},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledge

This repository is built on LAVIS and borrows code from scene graph benchmarking framework from SGTR.

License

BSD 3-Clause License