We build a multi-modal large language model for 3D scene understanding, excelling in tasks such as 3D grounding, captioning, and question answering.
[2024.09] 🔥 Chat-Scene has been accepted by NeurIPS 2024! [paper]
[2024.08] 🔥 We release Chat-Scene, capable of processing both 3D point clouds and 2D multi-view images for improved 3D scene understanding, leading to significant advancements in grounding and captioning performance.
[2024.04] We release a refined implementation (v2.1), which achieves better performance on grounding, captioning, and QA tasks. The code is available in branch v2.1.
[2023.12] We release Chat-3D v2 [paper], introducing object identifiers for enhanced object referencing and grounding in 3D scenes. The original code is available in branch v2.0.
[2023.08] We release Chat-3D [paper] [code], an LLM-based dialogue system for 3D scenes.
Performance Comparison
ScanRefer | Multi3dRefer | Scan2Cap | ScanQA | SQA3D | |||||
---|---|---|---|---|---|---|---|---|---|
Acc@0.25 | Acc@0.5 | F1@0.25 | F1@0.5 | CIDEr@0.5 | B-4@0.5 | CIDEr | B-4 | EM | |
v2.0 | 35.9 | 30.4 | - | - | 28.1 | 15.5 | 77.1 | 7.3 | - |
v2.1 | 42.5 | 38.4 | 45.1 | 41.6 | 63.9 | 31.8 | 87.6 | 14.0 | 54.7 |
Chat-Scene | 55.5 | 50.2 | 57.1 | 52.4 | 77.1 | 36.3 | 87.7 | 14.3 | 54.6 |
*The v2.1 and Chat-Scene results are based on single models without task-specific finetuning.
Main Changes
Prepare the environment:
conda create -n chat-scene python=3.9.17
conda activate chat-scene
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
Download LLM backbone:
We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.
Change the llama_model_path
in config.py to the path of vicuna-7b-v1.5
.
Annotations and extracted features:
Please follow the instructions in preprocess.
Training
train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=False
bash scripts/run.sh
Inference
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=True
pretrained_path="/path/to/pretrained_model.pth"
bash scripts/run.sh
If you find this project useful in your research, please consider cite:
@article{huang2023chat,
title={Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers},
author={Huang, Haifeng and Wang, Zehan and Huang, Rongjie and Liu, Luping and Cheng, Xize and Zhao, Yang and Jin, Tao and Zhao, Zhou},
journal={arXiv preprint arXiv:2312.08168},
year={2023}
}
@article{wang2023chat,
title={Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes},
author={Wang, Zehan and Huang, Haifeng and Zhao, Yang and Zhang, Ziang and Zhao, Zhou},
journal={arXiv preprint arXiv:2308.08769},
year={2023}
}
Stay tuned for our project. 🔥
If you have any questions or suggestions, feel free to drop us an email (huanghaifeng@zju.edu.cn
, wangzehan01@zju.edu.cn
) or open an issue.
Thanks to the open source of the following projects:
(Multi-modal) LLMs: LLaMA, Vicuna, VideoChat, LEO
3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer
Detectors: PointGroup, Mask3D, DEVA