Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

We present Robin3D, a state-of-the-art 3D Large Language Model trained on large-scale instruction-following data generated by our novel Robust Instruction Generation (RIG) data engine. To handle our RIG-generated complex data, our Robin3D further enhances its spatial understanding by Relation-Augmented Projector and improves the object referring and grounding ability by ID-Feature Bonding.

News

[2024.09] We release Robin3D [paper][code], a new SOTA 3D LLM for 3D scenes.

🔥 Robin3D vs Previous Methods

performance

🔨 Preparation

Prepare the environment:

conda create -n robin3d python=3.9.17
conda activate robin3d
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Download LLM backbone:
- We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.
Annotations and extracted features:

Please follow the instructions in Chat-Scene's Preparation.

🤖 Training and Inference

Coming soon.

📄 Citation

Coming soon.

Stay tuned for our project. 🔥

If you have any questions or suggestions, feel free to drop us an email (wkang11@hawk.iit.edu) or open an issue.

😊 Acknowledgement

Thanks to the open source of the following projects:

LLMs: LLaMA, Vicuna,

3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer, Grounded-3DLLM, Chat-Scene

Detectors: Mask3D,

Representations: Uni3D, DINOv2

3D Models: OpenScene

WeitaiKang / Robin3D