ZCMax / ScanReason

[ECCV 2024] Empowering 3D Visual Grounding with Reasoning Capabilities
Apache License 2.0
45 stars 1 forks source link


Empowering 3D Visual Grounding with Reasoning Capabilities

ECCV 2024
Chenming ZhuTai WangWenwei ZhangKai ChenXihui Liu*
The University of Hong Kong Shanghai AI Laboratory

[![arXiv](https://img.shields.io/badge/arXiv-2402.16174-blue)](https://arxiv.org/abs/2407.01525) [![](https://img.shields.io/badge/Paper-%F0%9F%93%96-blue)](./assets/ECCV_2024_ScanReason.pdf) [![](https://img.shields.io/badge/Project-%F0%9F%9A%80-blue)](https://zcmax.github.io/projects/ScanReason/)

πŸ“¦ Benchmark and Model

Benchmark Overview

ScanReason is the first comprehensive and hierarchical 3D reasoning grounding benchmark. We define 5 types of questions depending on which type of reasoning is required: Spatial reasoning and function reasoning require fundamental understanding of the 3D physical world, focusing on objects themselves and inter-object spatial relationships in a 3D scene respectively, and logistic reasoning, emotional reasoning, and safety reasoning are high-level reasoning skills built upon the two fundamental reasoning abilities to address user-centric real-world applications.

Model Overview

Getting Started

1. Installation

2. Data Preparation

  1. Follow EmbodiedScan Data Preparation Doc to download the raw scan (RGB-D) datasets and modify the VIDEO_FOLDER in train_ds.sh to the raw data path.

  2. Download the text annotations from Google Drive and modify the JSON_FOLDER in train_ds.sh to the annotations path, and modify the INFO_FILE data path which is included in the annotations.

3. Training ReGround3D

We provide the slurm training script with 4 A100 GPUs:

./scripts/train_ds.sh

4. Evaluation ReGround3D

After training, you can run the

./scripts/convert_zero_to_fp32.sh 

to convert the weights to pytorch_model.bin file, and then use

./scripts/merge_lora_weights.sh

to merge lora weight and obtain the final checkpoints under ReGround3D-7B.

Finally, run

./scripts/eval_ds.sh

to obtain the grounding results.

πŸ“ TODO List

πŸ“„ License

Creative Commons License
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

πŸ‘ Acknowledgements

This repo benefits from LISA, EmbodiedScan, 3D-LLM, LLaVA.