Chenming Zhu
Tai Wang*
Wenwei Zhang
Jiangmiao Pang
Xihui Liu*
The University of Hong Kong Shanghai AI Laboratory
LLaVA-3D could perform both 2D and 3D vision-language tasks. The left block (b) shows that compared with previous 3D LMMs, our LLaVA-3D achieves state-of-the-art performance across a wide range of 3D benchmarks while maintaining a comparable performance on various 2D benchmarks compared with LLaVA-1.5. The middle block (c) demonstrates that LLaVA-3D is built on the 2D LMM: LLaVA, and leverages 3D patches to endow it with 3D spatial awareness, enabling it to perform various 3D vision-and-language tasks in the physical world. The right blocks (d) and (e) highlights the significantly faster convergence and inference speeds of LLaVA-3D compared to existing 3D LMMs.
LLaVA-3D Architecture. Based on LLaVA, we directly add the corresponding 3D position embeddings to 2D patch visual tokens of multi-view images to construct the 3D Patches, then the 3D Patches will undergo 3D pooling and be sent into the projection layer of LLaVA to map into the LLM space and align with the LLM using 3D-visual-language data.
We test our codes under the following environment:
To start:
git clone https://github.com/ZCMax/LLaVA-3D.git
cd LLaVA-3D
conda create -n llava-3d python=3.10 -y
conda activate llava-3d
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install -e .
Download the Camera Parameters File and put the json file under the ./playground/data/annotations
.
Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
The trained model checkpoints are available here. Currently we only provide the 7B model, and we will continue to update the model zoo.
We currently support single image as inputs for 2D tasks and posed RGB-D images as inputs for 3D tasks. You can run the demo by using the script llava/eval/run_llava_3d.py
. For 2D tasks, use the image-file
parameter, and for 3D tasks, use the video-path
parameter to provide the corresponding data. Here, we provide some demos as examples:
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--image-file https://llava-vl.github.io/static/images/view.jpg \
--query "What are the things I should be cautious about when I visit here?"
We provide the demo scene here. Download the demo data and put it under the ./demo
.
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0356_00 \
--query "Tell me the only object that I could see from the other room and describe the object."
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0566_00 \
--query "The related object is located at [0.981, 1.606, 0.430]. Describe the object in detail."
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0382_01 \
--query "The related object is located at [-0.085,1.598,1.310]. Please output the 3D bounding box of the object and then describe the object."
If you find our work and this codebase helpful, please consider starring this repo 🌟 and cite:
@article{zhu2024llava,
title={LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness},
author={Zhu, Chenming and Wang, Tai and Zhang, Wenwei and Pang, Jiangmiao and Liu, Xihui},
journal={arXiv preprint arXiv:2409.18125},
year={2024}
}
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.