AlignGPT-VL / AlignGPT

Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"
https://aligngpt-vl.github.io/
29 stars 3 forks source link
large-language-models multimodal-large-language-models visual-language-models

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

[Project Page] [Paper] [Demo] [Model]

Authors: Fei Zhao*, Taotian Pang*, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai

News and Updates

Contents

Install

Docker

We recommend to use docker to prepare the environment.

  1. Clone this repository and navigate to AlignGPT folder
git clone https://github.com/AlignGPT-VL/AlignGPT.git
cd AlignGPT
  1. Build the docker image
cd deploy
docker build -t aligngpt:1.0 .

If your machine cannot connect to github to download the flash attention pip wheel, you can download it manually on https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl and put it to deploy/flash_attn-2.5.5+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl.

  1. To start the container, run the following command in the project root directory
docker run --gpus all --ipc=host --network=host --rm -it -v .:/workspace aligngpt:1.0

More -v options can be added to mount the data and output directories.

Conda

  1. Clone this repository and navigate to AlignGPT folder
git clone https://github.com/AlignGPT-VL/AlignGPT.git
cd AlignGPT
  1. Install Package
conda create -n aligngpt python=3.10 -y
conda activate aligngpt
pip install --upgrade pip  # enable PEP 660 support
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r deploy/requirements.txt

Finally, you need to install flash-attention manually before running the model.

Model Zoo

Please download the weights for LLM, Vision Backbone and place them in the ./playground/model folder, we also provide all the weights for the AlignGPT checkpoint.

Model LLM Vision Backbone Pre-training Instruct-tuning
AlignGPT-7B Vicuna 7B CLIP ViT-L/14 aligngpt-7b-pretrain aligngpt-7b
AlignGPT-13B Vicuna 13B CLIP ViT-L/14 aligngpt-13b-pretrain aligngpt-13b
AlignGPT-LLaMA2 LLaMA-2-7B-Chat CLIP ViT-L/14 To be released To be released
AlignGPT-LLaMA3 LLaMA-3-8B-Base CLIP ViT-L/14 To be released To be released

Demo

Start Gradio UI

You can start gradio service with the following command:

cd AlignGPT
bash start_api.sh

This script will launch three processes: the controller, the Gradio web server, and the model worker, all of which will run in the background. You can view logs of these processes in folder log/, and view process status with command ps -ef | grep src.serve.

CLI Inference

Chat about images using AlignGPT without the need of Gradio interface.

python -m src.serve.cli \
    --model-path playground/model/aligngpt-13b \
    --image-file "image folder/image.jpg" \

Training

We place all training data in the ./playground/data folder. Please download aligngpt_pretrain_data from aliyupan and place it in ./playground/data. We use the MP4 format to store the data as the storage provider has set restrictions on sharing zipped files. After downloading the data, run the following script:

wget https://raw.githubusercontent.com/starreeze/drin/main/dataset/data_tools.py
python data_tools.py --dir path/to/datadir \
    --raw_files aligngpt_pretrain_data.tar.xz \
    --encoded_files pretrain.mp4

It will convert the data to the zipped format and verify md5 checksums. Then unzip the files as usual.

Pre-training

├── LLaVA-Pretrain
│   └── blip_laion_cc_sbu_558k_with_similarity_number.json
│   └── images

Instruction-tuning

├── llava_v1_5_mix665k.json
├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

Evaluation

We place all evaluation data in the ./playground/data/eval folder. Please download aligngpt_eval_data from aliyupan and place it in ./playground/data/eval. We use the MP4 format to store the data as the storage provider has set restrictions on sharing zipped files. After downloading the data, run the following script:

wget https://raw.githubusercontent.com/starreeze/drin/main/dataset/data_tools.py
python data_tools.py --dir path/to/datadir \
    --raw_files aligngpt_eval_data.tar.xz \
    --encoded_files eval.mp4

It will convert the data to the zipped format and verify md5 checksums. Then unzip the files as usual.

We conduct evaluation on 12 benchmarks. Here, we demonstrate how to evaluate the performance of our model on MME dataset. We use the following command to run the evaluation stage:

CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mme.sh

You should set the directories of the model checkpoints and datasets in the scripts before running it. The evaluation of other datasets can be found in Evaluation.md.

Performance

Model VQAv2 GQA VizWiz SQA T-VQA POPE MME MM-Bench MM-Bench-CN SEED LLaVA-Bench-Wild MM-Vet
AlignGPT-7B 79.1 62.9 54.2 68.5 58.4 86.0 1527.4 67.3 59.9 66.5 68.4 30.8
AlignGPT-13B 80.0 63.6 56.4 70.3 60.2 86.2 1572.0 69.5 63.7 67.8 75.2 35.6

Citation

If you find AlignGPT useful for your research and applications, please cite using this BibTeX:

@misc{zhao2024aligngpt,
      title={AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability}, 
      author={Fei Zhao and Taotian Pang and Chunhui Li and Zhen Wu and Junjie Guo and Shangyu Xing and Xinyu Dai},
      year={2024},
      eprint={2405.14129},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

We build our project based on LLaVA: Large Language and Vision Assistant.

License

Code License Data License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.