TencentARC / ViT-Lens

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
Project Homepage arXiv arXiv Static Badge

TL;DR: We present ViT-Lens, an approach for advancing omni-modal representation learning by leveraging a pretrained-ViT with modality Lens to comprehend diverse modalities.



πŸ”¨ Installation

conda create -n vit-lens python=3.8.8 -y
conda activate vit-lens

# Install pytorch>=1.9.0 
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch -y

# Install ViT-Lens
git clone https://github.com/TencentARC/ViT-Lens.git
cd ViT-Lens/
pip install -e vitlens/
pip install -r vitlens/requirements-training.txt
Training/Inference on OpenShape Triplets on 3D point clouds: environment setup (click to expand) ```shell conda create -n vit-lens python=3.8.8 -y conda activate vit-lens conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch -y conda install -c dglteam/label/cu113 dgl -y # Install ViT-Lens git clone https://github.com/TencentARC/ViT-Lens.git cd ViT-Lens/ pip install -e vitlens/ pip install -r vitlens/requirements-training.txt ```

πŸ” ViT-Lens Model

MN40 SUN.D NYU.D Audioset VGGSound ESC50 Clotho AudioCaps TAG.M IN.EEG Download
ImageBind(Huge) - 35.1 54.0 17.6 27.8 66.9 6.0/28.4 9.3/42.3 - - -
ViT-Lens-L 80.6 52.2 68.5 26.7 31.7 75.9 8.1/31.2 14.4/54.9 65.8 42.7 vitlensL

We release a one-stop ViT-Lens-L model (based on Large ViT) and show its performance on ModelNet40 (MN40, top1 accuracy), SUN RGBD Depth-only (SUN.D, top1 accuracy), NYUv2 Depth-only (NYU.D, top1 accuracy), Audioset (Audioset, mAP), VGGSound (VGGSound, top1 accuracy), ESC50 (ESC50, top1 accuracy), Clotho (Clotho, R@1/R@10), AudioCaps (AudioCaps, R@1/R@10), TAG.M (Touch-and-Go Material, top1 accuracy) and IN.EEG (ImageNet EEG, top1 accuracy). ViT-Lens consistently outperforms ImageBind.

For more model checkpoints (trained on different data or with better performance), please refer to MODEL_ZOO.md.

πŸ“š Usage

πŸ“¦ Datasets

Please refer to DATASETS.md for dataset preparation.

πŸš€ Training & Inference

Please refer to TRAIN_INFERENCE.md for details.

🧩 Model Zoo

Please refer to MODEL_ZOO.md for details.

πŸ‘€ Visualization of Demo

[ Plug ViT-Lens into SEED: Video Demo ]vitlens-seed.video
[ Plug ViT-Lens into SEED: enabling compound Any-to-Image Generation ]vitlens-seed
[ Plug ViT-Lens into InstructBLIP: Video Demo ]insblip.video
[ Plug ViT-Lens into InstructBLIP: enabling Any instruction following ]vitlens.instblip2
[ Plug ViT-Lens into InstructBLIP: enabling Any instruction following ]mmvitlens.instblip3
[ Example: Plug 3D lens to LLM ]plant
[ Example: Plug 3D lens to LLM ]piano

πŸŽ“ Citation

If you find our work helps, please give us a star🌟 and consider citing:

    author    = {Lei, Weixian and Ge, Yixiao and Yi, Kun and Zhang, Jianfeng and Gao, Difei and Sun, Dylan and Ge, Yuying and Shan, Ying and Shou, Mike Zheng},
    title     = {ViT-Lens: Towards Omni-modal Representations},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {26647-26657}

βœ‰οΈ Contact

Questions and discussions are welcome via leiwx52@gmail.com or open an issue.

πŸ™ Acknowledgement

This codebase is based on open_clip, ULIP, OpenShape and LAVIS. Big thanks to the authors for their awesome contributions!