Haiyang-W / GiT

[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"
Apache License 2.0
293 stars 12 forks source link
foundation-models perception transformer unified vision-and-language vision-transformer

The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

[![arXiv](https://img.shields.io/badge/Arxiv-2403.09394-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2403.09394) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/Haiyang-W/GiT/blob/main/LICENSE) [![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FHaiyang-W%2FGiT%2Ftree%2Fmain&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com) [![GitHub issues](https://img.shields.io/github/issues/Haiyang-W/GiT?color=critical&label=Issues)](https://github.com/Haiyang-W/GiT/issues?q=is%3Aopen+is%3Aissue) [![GitHub closed issues](https://img.shields.io/github/issues-closed/Haiyang-W/GiT?color=success&label=Issues)](https://github.com/Haiyang-W/GiT/issues?q=is%3Aissue+is%3Aclosed) [![Twitter](https://img.shields.io/badge/Twitter-🔥%2036k%20views-b31b1b.svg?style=social&logo=twitter)](https://twitter.com/_akhaliq/status/1768484390873477480)

This repo is the official implementation of ECCV2024 Oral paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$

  • Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )

📣 News

💫 What we want to do

The Model Architectures across various AI domains are converging towards Multi-Layer Plain Transformers.

🤔 What we achieve

Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:


🚀 Main Results

Single-Task Benchmark

Model Params Metric Perfomance ckpt log config
GiT-Bdetection 131M mAP 45.1 ckpt log config
GiT-Binsseg 131M mAP 31.4 ckpt log config
GiT-Bsemseg 131M mIoU 47.7 ckpt log config
GiT-Bcaption 131M BLEU-4 33.7 ckpt log config
GiT-Bgrounding 131M Acc@0.5 83.3 ckpt log config

Multi-Tasking Benchmark

Model Params Detection Ins Seg Sem Seg Caption Grounding ckpt log config
GiT-Bmulti-task 131M 46.7 31.9 47.8 35.3 85.8 ckpt log config
GiT-Lmulti-task 387M 51.3 35.1 50.6 35.7 88.4 ckpt log config
GiT-Hmulti-task 756M 52.9 35.8 52.4 36.2 89.2 ckpt log config
<!-- GiT-Bsingle-task 131M 45.1 31.4 47.7 33.7 83.3 ckpt log config -->

Task Synergy in Multi-Tasking Training

Model Params Detection Ins Seg Sem Seg Caption Grounding
GiT-Bsingle-task 131M 45.1 31.4 47.7 33.7 83.3
Improvement +1.6 +0.5 +0.1 +1.6 +2.5
GiT-Bmulti-task 131M 46.7 31.9 47.8 35.3 85.8

Zero-shot benchmark

Model Params Cityscapes
(Ins Seg)
(Sem Seg)
SUN RGB-D nocaps ckpt log config
GiT-Bmulti-task 131M 21.8 14.3 34.4 30.9 9.2 ckpt log config
GiT-Buniversal 131M 29.1 17.9 56.2 37.5 10.6 ckpt log config
GiT-Luniversal 387M 32.3 20.3 58.0 39.9 11.6 ckpt log config
GiT-Huniversal 756M 34.1 18.7 61.8 42.5 12.6 ckpt log config

Few-shot benchmark

Model Params DRIVE LoveDA Potsdam WIDERFace DeepFashion config
GiT-Bmulti-task 131M 34.3 24.9 19.1 17.4 23.0 config
GiT-Buniversal 131M 51.1 30.8 30.6 31.2 38.3 config
GiT-Luniversal 387M 55.4 34.1 37.2 33.4 49.3 config
GiT-Huniversal 756M 57.9 35.1 43.4 34.0 52.2 config

🛠️ Quick Start


conda create -n GiT python=3.8

conda activate GiT

# We only test in 1.9.1, may be other versions are also worked.
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

pip install -U openmim
mim install "mmengine==0.8.3"
mim install "mmcv==2.0.1"
pip install "transformers==4.31.0"

git clone git@github.com:Haiyang-W/GiT.git
cd GiT
pip install -v -e .
pip install -r requirements/optional.txt
pip install -r requirements/runtime.txt

# if you face ChildFailedError, please update yapf
pip install yapf==0.40.1

Dataset Preparation

Multi-Tasking Dataset

Multi-tasking benchmark contains coco2017 for object detection and instance segmentation, ade20k for semantic segmentation, coco caption for image caption, and refcoco series for visual grounding.

|  |──ade
|  |  |──ADEChallengeData2016
|  |  |  |──annorations
|  |  |  |  |──training & validation
|  |  |  |──images
|  |  |  |  |──training & validation
|  |  |  |──objectInfo150.txt
|  |  |  |──sceneCategories.txt
|  |──coco
|  |  |──annotations
|  |  |  |──*.json
|  |  |──train2017
|  |  |  |──*.jpg
|  |  |──val2017
|  |  |  |──*.jpg
|  |──coco_2014
|  |  |──annotations
|  |  |  |──*.json
|  |  |  |──coco_karpathy_test.json
|  |  |  |──coco_karpathy_train.json
|  |  |  |──coco_karpathy_val_gt.json
|  |  |  |──coco_karpathy_val.json
|  |  |──train2014
|  |  |  |──*.jpg
|  |  |──val2014
|  |  |  |──*.jpg
|  |  |──refcoco
|  |  |  |──*.p

Universal Dataset

We use 27 datasets in universal training. For more details about dataset preparation, please refer to here.

🚨 We only list part of the commands (GiT-B) below. For more detailed commands, please refer to here.


Single Task


bash tools/dist_train.sh configs/GiT/single_detection_base.py  ${GPU_NUM} --work-dir ${work_dir}

Multi Task


bash tools/dist_train.sh configs/GiT/multi_fivetask_base.py  ${GPU_NUM} --work-dir ${work_dir}

Universal Training


bash tools/dist_train.sh configs/GiT/universal_base.py  ${GPU_NUM} --work-dir ${work_dir}


Single Task


bash tools/dist_test.sh configs/GiT/single_detection_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Multi Task


bash tools/dist_test.sh configs/GiT/multi_fivetask_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Zero-shot and few-shot

Please download universal pretrain weight from huggingface and organize files as follows:



bash tools/dist_test.sh configs/GiT/zero-shot/zero_shot_cityscapes_det_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}


bash tools/dist_train.sh configs/GiT/few-shot/few_shot_drive_det_base.py ${GPU_NUM} --work-dir ${work_dir}

Customize Dataset

If you want to use GiT on your own dataset, please refer here for more details.

🚀 Lightweight Version

If your GPU memory is insufficient, you can reduce the resolution like here, where we lower the detection resolution to 672. It requires ~20 hours of training and reaches ~41.5 mAP.

👀 Todo

👍 Acknowledgement

📘 Citation

Please consider citing our work as follows if it is helpful.

    title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},
    author={Haiyang Wang and Hao Tang and Li Jiang and Shaoshuai Shi and Muhammad Ferjad Naeem and Hongsheng Li and Bernt Schiele and Liwei Wang},
    journal={arXiv preprint arXiv:2403.09394},

✨ Star History

Star History Chart