This repository contain the code for our TMLR24 paper Mantis (https://arxiv.org/abs/2405.01483).
π€ The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved.
π¦ The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective.
π₯ Therefore, we present Mantis, an LLaMA-3 based LMM with interleaved text and image as inputs, train on Mantis-Instruct under academic-level resources (i.e. 36 hours on 16xA100-40G).
π Mantis achieves state-of-the-art performance on 5 multi-image benchmarks (NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval), and maintaining a strong single-image performance on par with CogVLM and Emu2.
conda create -n mantis python=3.10
conda activate mantis
pip install -e .
# install flash-attention
pip install flash-attn --no-build-isolation
You can run inference with the following command:
cd examples
python run_mantis.py
Install the requirements with the following command:
pip install -e .[train,eval]
cd mantis/train
Our training scripts follows the coding format and model structure of Hugging face. Different from LLaVA Github repo, you can directly load our models from Hugging Face model hub.
(These example data are all pre-prepared in the data/examples/
folder, so you can check the format of the data and the debug the training script directly. set CUDA_VISIBLE_DEVICES
to the GPU you want to use.)
training with text-image interleaved data (see example data)
cd mantis/train
bash scripts/train_example_chat.sh # Q-lora, 1 GPU required
training with video-text interleaved data (see example data)
cd mantis/train
bash scripts/train_example_video.sh # Q-lora, 1 GPU required
training with classification data (see example data)
cd mantis/train
bash scripts/train_example_classification.sh # full-finetune, might need 8 GPUs or more
We support training of Mantis based on the Fuyu architecture and the LLaVA architecture. You can train the model with the following command:
Training Mantis based on LLaMA3 with CLIP/SigLIP encoder:
Pretrain Mantis-LLaMA3 Multimodal projector on pretrain data (Stage 1):
bash scripts/pretrain_mllava.sh
Fine-tune the pretrained Mantis-LLaMA3 on Mantis-Instruct (Stage 2):
bash scripts/train_mllava.sh
Training Mantis based on Fuyu-8B:
bash scripts/train_fuyu.sh
Note:
See mantis/train/README.md for more details.
Check all the training scripts in mantist/train/scripts
To reproduce our evaluation results, please check mantis/benchmark/README.md
you can easily preparing Mantis-Insturct's downloading with the following command (The downloading and extracting might take about an hour):
python data/download_mantis_instruct.py --max_workers 8
We provide the following models in the π€ Hugging Face model hub:
Run Mantis-8B-Idefics2:
cd examples && python run_mantis_idefics2.py
Mantis-8B-siglip-llama3:
cd examples && python run_mantis.py
Mantis-8B-Fuyu:
cd examples && python run_mantis_fuyu.py
We provide a simple chat CLI for Mantis models. You can run the following command to chat with Mantis-8B-siglip-llama3:
python examples/chat_mantis.py
The following intermediate checkpoints after pre-training the multi-modal projectors are also available for experiments reproducibility (Please note the follwing checkpoints still needs further fine-tuning on Mantis-Instruct to be intelligent. They are not working models.):
@article{Jiang2024MANTISIM,
title={MANTIS: Interleaved Multi-Image Instruction Tuning},
author={Dongfu Jiang and Xuan He and Huaye Zeng and Cong Wei and Max W.F. Ku and Qian Liu and Wenhu Chen},
journal={Transactions on Machine Learning Research},
year={2024},
volume={2024},
url={https://openreview.net/forum?id=skLtdUVaJa}
}