ECCV 2024
Chenming Zhu
Tai Wang
Wenwei Zhang
Kai Chen
Xihui Liu*
The University of Hong Kong Shanghai AI Laboratory
ScanReason is the first comprehensive and hierarchical 3D reasoning grounding benchmark. We define 5 types of questions depending on which type of reasoning is required: Spatial reasoning and function reasoning require fundamental understanding of the 3D physical world, focusing on objects themselves and inter-object spatial relationships in a 3D scene respectively, and logistic reasoning, emotional reasoning, and safety reasoning are high-level reasoning skills built upon the two fundamental reasoning abilities to address user-centric real-world applications.
1. Installation
We utilize at least 4 A100 GPU for training and inference.
We test the code under the following environment:
Git clone our repository and creating conda environment:
git clone https://github.com/ZCMax/ScanReason.git
conda create -n scanreason python=3.9
conda activate scanreason
pip install -r requirements.txt
Follow EmbodiedScan Installation Doc to install embodiedscan series.
Compile Pointnet2
cd pointnet2
python setup.py install --user
2. Data Preparation
Follow EmbodiedScan Data Preparation Doc to download the raw scan (RGB-D) datasets and modify the VIDEO_FOLDER
in train_ds.sh
to the raw data path.
Download the text annotations from Google Drive and modify the JSON_FOLDER
in train_ds.sh
to the annotations path, and modify the INFO_FILE
data path which is included in the annotations.
3. Training ReGround3D
We provide the slurm training script with 4 A100 GPUs:
./scripts/train_ds.sh
4. Evaluation ReGround3D
After training, you can run the
./scripts/convert_zero_to_fp32.sh
to convert the weights to pytorch_model.bin
file, and then use
./scripts/merge_lora_weights.sh
to merge lora weight and obtain the final checkpoints under ReGround3D-7B
.
Finally, run
./scripts/eval_ds.sh
to obtain the grounding results.
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This repo benefits from LISA, EmbodiedScan, 3D-LLM, LLaVA.