This is the official repository for paper "Mono3DVG: 3D Visual Grounding in Monocular Images". [AAAI paper] [ArXiv paper] [AAAI Video/Poster]
The paper has been accepted by AAAI 2024 🎉.
School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University
We introduce a novel task of 3D visual grounding in monocular RGB images using descriptions with appearance and geometry information, termed Mono3DVG. Mono3DVG aims to localize the true 3D extent of referred objects in an image using language descriptions with geometry information.
Download our Mono3DRefer dataset. We build the first dataset for Mono3DVG, termed Mono3DRefer, which can be downloaded from our Google Drive. The download link is available below:
https://drive.google.com/drive/folders/1ICBv0SRbRIUnl_z8DVuH8lz7KQt580EI?usp=drive_link
Mono3DVG-TR is the first end-to-end transformer-based network for monocular 3D visual grounding.
You can follow the environment of MonoDETR.
pip install -r requirements.txt
cd lib/models/mono3dvg/ops/
bash make.sh
cd ../../../..
│Mono3DVG/
├──Mono3DRefer/
│ ├──images/
│ │ ├──000000.png
│ │ ├──...
│ ├──calib/
│ │ ├──000000.txt
│ │ ├──...
│ ├──Mono3DRefer_train_image.txt
│ ├──Mono3DRefer_val_image.txt
│ ├──Mono3DRefer_test_image.txt
│ ├──Mono3DRefer.json
│ ├──test_instanceID_split.json
├──configs
│ ├──mono3dvg.yaml
│ ├──checkpoint_best_MonoDETR.pth
├──lib
│ ├──datasets/
│ │ ├──...
│ ├──helpers/
│ │ ├──...
│ ├──losses/
│ │ ├──...
│ ├──models/
│ │ ├──...
├──roberta-base
│ ├──...
├──utils
│ ├──...
├──outputs # save_path
│ ├──mono3dvg
│ │ ├──...
├──test.py
├──train.py
You can also change the dataset path at "root_dir" in configs/mono3dvg.yaml
.
You can also change the save path at "save_path" in configs/mono3dvg.yaml
.
You must download the Pre-trained model of RoBERTa and MonoDETR.
You can download the checkpoint we provide to evaluate the Mono3DVG-TR model.
Models | Links | File Path | File Name |
RoBERTa | model | `roberta-base\` | `pytorch_model.bin` |
Pre-trained model (MonoDETR) | model | `configs\` | `checkpoint_best_MonoDETR.pth` |
Best checkpoint (Mono3DVG-TR) | model | `outputs\mono3dvg\` | `checkpoint_best.pth` |
You can modify the settings of GPU, models and training in configs/mono3dvg.yaml
CUDA_VISIBLE_DEVICES=1 python train.py
The best checkpoint will be evaluated as default.
You can change it at "pretrain_model: 'checkpoint_best.pth'" in configs/mono3dvg.yaml
:
CUDA_VISIBLE_DEVICES=1 python test.py
Fig.1 Blue, green, and red boxes denote the ground truth, prediction with IoU higher than 0.5, and prediction with IoU lower than 0.5, respectively.
Fig.2 Visualization of ’000152.png’ image’s localization results, the depth predictor’s depth maps, and the text-guided adapter’s attention score maps for our Mono3DVG-TR.
Fig.3 The gray block is the traditional query without specific geometry information.
Fig.4 The gray block is the traditional query without specific geometry information.
Fig.5 The gray block is the traditional query without specific geometry information.
@inproceedings{zhan2024mono3dvg,
title={Mono3DVG: 3D Visual Grounding in Monocular Images},
author={Zhan, Yang and Yuan, Yuan and Xiong, Zhitong},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={7},
pages={6988--6996},
year={2024}
}
Our code is based on (ICCV 2023)MonoDETR. We sincerely appreciate their contributions and authors for releasing source codes. I would like to thank Xiong zhitong and Yuan yuan for helping the manuscript. I also thank the School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University for supporting this work.
If you have any questions about this project, please feel free to contact zhanyangnwpu@gmail.com.