$\gamma$-MOD is a novel approach to enhance computational efficiency in Multimodal Large Language Models (MLLMs) by incorporating Mixture-of-Depth (MoD) layers. This plug-and-play strategy seamlessly replaces redundant dense layers, significantly reducing computational costs while maintaining performance.
Despite recent advancements in MLLMs, their high computational demands have limited practical applications, especially for real-time inference. Traditional Mixture-of-Experts (MoE) techniques have attempted to address this issue, but often fall short in achieving optimal efficiency. $\gamma$-MOD tackles this challenge by introducing a new paradigm that focuses on reducing activated tokens, offering superior efficiency compared to existing methods. Our approach is inspired by the concept of activated tokens and aims to transform dense MLLM layers into sparse MoD layers, ultimately making MLLMs more accessible and applicable in resource-constrained environments.
$\gamma$-MOD results in significant efficiency improvements:
Our $\gamma$-MOD approach demonstrates impressive efficiency in routing tokens and focusing on critical information. Fig. 4 illustrates these results visually.
Consistent Routing Patterns (Fig. 4a):
Efficient Content Skipping (Fig. 4b):
Improved Focus on Critical Information:
This visualization demonstrates how $\gamma$-MOD effectively reduces computational overhead while maintaining the model's ability to process and respond to complex multimodal inputs.
(Notice: Install the required packages and versions for the model you wish to modify to MoD version, below is for LLaVA-HR, for Mini-Gemini, just upgrade transformers to 4.36.2 as the official version)
git clone https://github.com/Yaxin9Luo/Gamma-MOD.git
cd Gamma-MOD
conda create -n gamma-mod python=3.10 -y
conda activate gamma-mod
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install ninja
pip install flash-attn --no-build-isolation
Please refer to the original LLaVA-HR and Mini-Gemini for data preparation. Or whatever MLLM's offical repo you are using.
Important Notice: For the Finetune stage, you need modify the data JSON file to move the image tokens to the beginning of the sequence. You can refer to modify_data_config.py
to do so, or you can follow the steps below:
python modify_data_config.py /path/to/your/llava_v1_5_mix665k.json /path/to/save/your/modified_llava_v1_5_mix665k.json
Please download the caption annotations blip_laion_cc_sbu_558k.json and images from here. Move the downloaded files to the /data/data folder. Then run the following command to start the training process:
bash bash scripts/v1_5/pretrain_llava_hr.sh
We recommend to directly pre-trained projector, here are the link from official LLaVA-HR and Mini-Gemini. | Version | Vision Encoder | Projection | Pretrain Data | Pretraining schedule | Download |
---|---|---|---|---|---|---|
LLaVA-HR-7b | CLIP-L & ConvNeXt-L | MLP-2x | LCS-558K | 1e | projector | |
LLaVA-HR-X-13b | CLIP-L & ConvNeXt-XXL | MLP-2x | LCS-558K | 1e | projector | |
Mini-Gemini-HD-7b | CLIP-L | MLP-2x | MGM-Pretrain | 1e | projector |
Please run the stage-1 alignment model on any dataset you wish to compute the ARank.We will use sqa as an example.
bash scripts/v1_5/eval_full/arank.sh /path/to/your/stage1_checkpoint
We also provide the stage-1 checkpoint for your convenience. | Version | Download |
---|---|---|
$\gamma$-MOD-llava-hr-7b-stage1 | model | |
$\gamma$-MOD-Mini-Gemini-HD-7b-stage1 | model |
After you get the ARank, you can use the ARank to replace the dense layers in the original model. Reference to llava_llama_mod.py file and the initialize_mod_modules function. Then train the model with the following command:
bash /path/to/your/fine_tune_mod.sh
We also provide the stage-2 sft checkpoint for your convenience. | Version | Download |
---|---|---|
$\gamma$-MOD-llava-hr-7b-0.34 | model | |
$\gamma$-MOD-llava-hr-13b-0.34 | model | |
$\gamma$-MOD-llava-hr-13b-0.5 | model | |
$\gamma$-MOD-Mini-Gemini-HD-7b-0.34 | model | |
$\gamma$-MOD-Mini-Gemini-HD-7b-0.5 | model |
We follow LLaVA-v1.5 to conduct evaluations. you should download eval.zip and unzip it to ./playground/data/eval
. Please refer to Evaluation.md to prepare the data.
Then, your can run our evaluation script bash scripts/v1_5/eval.sh
.
$\gamma$-MOD was tested on three popular MLLMs across 9 benchmark datasets.
Model | Training Time Reduction | Inference Time Reduction | Accuracy |
---|---|---|---|
$\gamma$-MoD-LLaVA-HR-7B | 31.0% | 53.2% | -1.5% |
$\gamma$-MoD-LLaVA-HR-13B | 18.8% | 50.4% | -0.3% |
$\gamma$-MoD-LLaVA-HR-X-13B | 17.4% | 58.6% | +0.4% |
$\gamma$-MoD-Mini-Gemini-HD-7B | 41.0% | 58.1% | -1.0% |
For more details, check the full report.
If you use $\gamma$-MOD in your work, please cite:
@misc{luo2024gammamodexploringmixtureofdepthadaptation,
title={$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models},
author={Yaxin Luo and Gen Luo and Jiayi Ji and Yiyi Zhou and Xiaoshuai Sun and Zhiqiang Shen and Rongrong Ji},
year={2024},
eprint={2410.13859},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.13859},
}
For questions, please reach out to Yaxin Luo.
This project is licensed under the MIT License - see the LICENSE file for details.
Special thanks to all contributors and the LLaVA & LLaVA-HR & MGM project for codebase.
We are also thankful to LLaVA-pp, MoE-LLaVA for releasing their models and code as open-source contributions.