SQA3D

This repository hosts the code for the paper:

SQA3D: Situated Question Answering in 3D Scenes (ICLR 2023)

by Xiaojian Ma*, Silong Yong*, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu and Siyuan Huang

🔥Submission guide and leaderboard! | arXiv | slides | Project page

News

10/13/2023: We've release the code for zero-shot models (ex. LLMs) here.
08/25/2023: New benchmarking guide and leaderboard here.
04/01/2023: We introduce a new localization (situation understanding) task. Please see this for more details.
03/11/2023: We will host a challenge at CVPR 2023 3D Scene Understanding Workshop.
03/10/2023: Data visualization script have been released in the utils folder!
03/01/2023: ClipBERT pretrained model have been released!
02/13/2023: MCAN pretrained model have been released!
02/07/2023: Official project page is launched!
01/30/2023: SQA3D data, code and pretrained weights have beed released!

Abstract

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context(e.g., 3D scan), SQA3D requires the tested agent to first understand its situation(position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.

Installation

Install PyTorch:

conda install pytorch=1.12.0 torchvision torchaudio cudatoolkit=11.3 -c pytorch

Install pointnet2, please follow the instruction there.
Install the necessary packages with requirements.txt:
```
pip install -r requirements.txt
```

The code has been tested with Python 3.9, PyTorch 1.12.0 and CUDA 11.3 on Ubuntu 20.04

SQA3D data format

Please refer to data format. Note that we only provide the SQA3D annotations. To obtain the scene data (3D scans, egocentric videos or BEV pictures), please refer to Training. SQA3D data is hosted here.

Training

For each model, please refer to ScanQA, MCAN, ClipBERT, Zero-shot (LLM) for details on how to prepare the scene data and run some experiments.

Data Visualization

To visualize the data in SQA3D, run

python utils/visualize_data.py --scene_id <scene_id> --anno_path <anno_path> --ply_path <ply_path>

corresponds to the scene you want to visualize, the format should be `scenexxxx_00`. corresponds to the directory to the annotation file, should look like `dir/sqa_task`. corresponds to the directory to original ScanNet scans. ## Misc Please change to the corresponding directory when running experiments with the models. For example, to experiment with MCAN ```shell cd MCAN ``` ## License - Code: [Apache](https://github.com/SilongYong/SQA3D/blob/master/LICENSE) - Data: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) ## Citation If you find our work helpful for your research. Please consider citing our paper. ```bibtex @inproceedings{ma2022sqa3d, title={SQA3D: Situated Question Answering in 3D Scenes}, author={Ma, Xiaojian and Yong, Silong and Zheng, Zilong and Li, Qing and Liang, Yitao and Zhu, Song-Chun and Huang, Siyuan}, booktitle={International Conference on Learning Representations}, year={2023}, url={https://openreview.net/forum?id=IDJx97BC38} } ```

SilongYong / SQA3D

readme