Our paper "From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models" has been accepted by CVPR 2024.
conda create -n pix2sgg python=3.8
conda activate pix2sgg
# CUDA 11.8
conda install pytorch==2.0.0 torchvision==0.15.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# or CUDA 10.2
conda install pytorch==1.10.2 torchvision==0.11.3 cudatoolkit=10.2 -c pytorch
Install other dependencies:
pip install -r requirements_pix2sgg.txt
# the hugging face version: v4.29.2
Our work is built upon LAVIS, sharing the majority of its requirements.
Build Project
python setup.py build develop
Check DATASET.md for instructions of dataset preprocessing.
The model weight can be download from: https://huggingface.co/rj979797/PGSG-CVPR2024/tree/main
Novel+base | Novel | checkpoint | ||||
---|---|---|---|---|---|---|
Datasets | mR50/100 | R50/100 | mR50/100 | |||
VG | 6.2/8.3 | 15.1/18.4 | 3.7/5.2 | vg_ov_sgg.pth | ||
VG-SGCls | 9.7/13.8 | 26.8/33.2 | 5.1/7.7 | vg_ov_sgg.pth | ||
PSG | 15.3/17.7 | 23.7/25.4 | 6.7/9.6 | psg_ov_sgg.pth | ||
<!-- | OIv6 | --> |
Datasets | mR50/100 | R50/100 | checkpoint | ||
---|---|---|---|---|---|
VG | 9.0/11.5 | 17.7/ 20.7 | vg_sgg.pth | ||
PSG | 14.5/17.6 | 25.8/28.9 | psg_sgg.pth | ||
VG-c | 10.4/12.7 | 20.3/23.6 | vg_sgg_close_clser.pth | ||
PSG-c | 21.2/22.0 | 34.9/36.1 | psg_sgg_close_clser.pth | ||
<!-- | OIv6 | --> |
Our PGSG is trained using the BLIP pre-trained weights, accessible here.
Ensure that the checkpoint path in the configuration file (*.yaml) is accurate before training or evaluation. During training, utilize the checkpoint specified by model.pretrained
, while for evaluation, load the checkpoint specified by model.finetuned
.
Training
python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py lavis/projects/blip/train/vrd_vg_ft_pgsg_ov.yaml --job-name VG-pgsg_ovsgg
Evaluation
python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_vg_pgsg_eval_ov.yaml --job-name VG-pgsg_stdsgg-eval
Training
python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py lavis/projects/blip/train/vrd_vg_ft_pgsg.yaml --job-name VG-pgsg_ovsgg
Evaluation
python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_vg_pgsg_eval.yaml --job-name VG-pgsg_stdsgg-eval
Training
python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py --cfg-path lavis/projects/blip/train/vrd_psg_ft_pgsg_ov.yaml --job-name psg-pgsg_ovsgg
Evaluation
python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_psg_ov.yaml --job-name psg-pgsg_ovsgg-eval
Training
python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py --cfg-path lavis/projects/blip/train/vrd_psg_ft_pgsg.yaml --job-name psg-pgsg_stdsgg
Evaluation
python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_psg_eval.yaml --job-name psg-pgsg_stdsgg-eval
If you find this project helps your research, please kindly consider citing our papers in your publications.
@misc{li2024pixels,
title={From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models},
author={Rongjie Li and Songyang Zhang and Dahua Lin and Kai Chen and Xuming He},
year={2024},
eprint={2404.00906},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This repository is built on LAVIS and borrows code from scene graph benchmarking framework from SGTR.