ZiyuGuo99 / Point-Bind_Point-LLM

Align 3D Point Cloud with Multi-modalities for Large Language Models
MIT License
400 stars 31 forks source link

Point-Bind & Point-LLM: Aligning 3D with Multi-modality

Official implementation of 'Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following'.

News

Point-Bind

With a joint embedding space of 3D and multi-modality, our Point-Bind empowers four promising applications:


Point-LLM

Using Point-Bind, we introduce Point-LLM, the first 3D LLM that responds to instructions with 3D point cloud conditions, supporting both English and Chinese. Our Point-LLM exhibits two main characters:


The overall pipeline of Point-LLM is as follows. We efficiently fine-tune LLaMA 7B for 3D instruction-following capacity referring to LLaMA-Adapter and ImageBind-LLM:


Getting Started

Please refer to Install.md for preparing environments and pre-trained checkpoints.

3D with Multi-modalities

We provide simple inference scripts to verify the embedding alignment for 3D and other modalities in Point-Bind.

Compare 3D with Text

Run python demo_text_3d.py with input:

text_list = ['An airplane', 'A car', 'A toilet']
point_paths = ["examples/airplane.pt", "examples/car.pt", "examples/toilet.pt"]

Output the similarity matrix:

Text x Point Cloud
tensor([[1.0000e+00, 6.5731e-09, 6.5958e-10],
        [1.7373e-06, 9.9998e-01, 1.7816e-05],
        [2.1133e-10, 3.4070e-08, 1.0000e+00]])

Compare 3D with Audio

Run python demo_audio_3d.py with input: Input

audio_paths = ["examples/airplane_audio.wav", "examples/car_audio.wav", "examples/toilet_audio.wav"]
point_paths = ["examples/airplane.pt", "examples/car.pt", "examples/toilet.pt"]

Output the similarity matrix:

Audio x Point Cloud: 
tensor([[0.9907, 0.0041, 0.0051],
        [0.0269, 0.9477, 0.0254],
        [0.0057, 0.0170, 0.9773]])

3D Zero-shot Tasks

For 3D zero-shot classification, please follow DATASET.md to download ModelNet40, and put it under data/modelnet40_normal_resampled/. Then run bash scripts/pointbind_i2pmae.sh or bash scripts/pointbind_pointbert.sh for Point-Bind with I2P-MAE or Point-BERT encoder.

Zero-shot classification accuracy comparison: Model Encoder ModeNet40 (%)
PointCLIP 2D CLIP 20.2
ULIP Point-BERT 60.4
PointCLIP V2 2D CLIP 64.2
ULIP 2 Point-BERT 66.4
Point-Bind Point-BERT 76.3
Point-Bind I2P-MAE 78.0

Inference & Demo for Point-LLM

Setup

Inference

Demo

Try out our web demo, which incorporates multi-modality including 3D point cloud supported by ImageBind-LLM

Contributors

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Peng Gao

Related Work

Other excellent works for incorporating 3D point clouds and LLMs:

Contact

If you have any questions about this project, please feel free to contact zhangrenrui@pjlab.org.cn and zyguo@cse.cuhk.edu.hk.