Yaxin9Luo / Gamma-MOD

Officail Repo of γ -MOD: Mixture-of-Depth Adaptation for Multimodal Large Language Models
https://yaxin9luo.github.io/gamma-mod-webpage/
Other
19 stars 2 forks source link

$\gamma$-MOD: Mixture-of-Depth Adaptation for Multimodal Large Language Models

Gamma-MOD Banner

[Version]() Project Arxiv Open In Spaces License: MIT Contact

📣 News

🔗 Table of Contents

🚀 Overview

$\gamma$-MOD is a novel approach to enhance computational efficiency in Multimodal Large Language Models (MLLMs) by incorporating Mixture-of-Depth (MoD) layers. This plug-and-play strategy seamlessly replaces redundant dense layers, significantly reducing computational costs while maintaining performance.

💡 Motivation

Despite recent advancements in MLLMs, their high computational demands have limited practical applications, especially for real-time inference. Traditional Mixture-of-Experts (MoE) techniques have attempted to address this issue, but often fall short in achieving optimal efficiency. $\gamma$-MOD tackles this challenge by introducing a new paradigm that focuses on reducing activated tokens, offering superior efficiency compared to existing methods. Our approach is inspired by the concept of activated tokens and aims to transform dense MLLM layers into sparse MoD layers, ultimately making MLLMs more accessible and applicable in resource-constrained environments.

⭐ Key Features:

📊 Efficiency Gains

$\gamma$-MOD results in significant efficiency improvements:

Our $\gamma$-MOD approach demonstrates impressive efficiency in routing tokens and focusing on critical information. Fig. 4 illustrates these results visually.

Key Observations:

Visualization of Routing and Skipped Content

  1. Consistent Routing Patterns (Fig. 4a):

    • Question tokens are mostly retained
    • Image tokens show the highest redundancy and are routed the most
    • Response tokens fall between these two extremes
  2. Efficient Content Skipping (Fig. 4b):

    • Gray areas in images represent skipped tokens (often background or less relevant pixels)
    • White areas highlight regions the model focuses on more intensely
  3. Improved Focus on Critical Information:

    • By routing out redundant tokens, the model can allocate more computational resources to important areas
    • Example: In the IQ test image (middle of first row), the model concentrates on arithmetic and geometric aspects, leading to more accurate responses

This visualization demonstrates how $\gamma$-MOD effectively reduces computational overhead while maintaining the model's ability to process and respond to complex multimodal inputs.


🛠️ Getting Started

Installation

(Notice: Install the required packages and versions for the model you wish to modify to MoD version, below is for LLaVA-HR, for Mini-Gemini, just upgrade transformers to 4.36.2 as the official version)

  1. Clone the repository and navigate to the $\gamma$-MOD folder:
git clone https://github.com/Yaxin9Luo/Gamma-MOD.git
cd Gamma-MOD
  1. Create and activate a new conda environment:
conda create -n gamma-mod python=3.10 -y
conda activate gamma-mod
  1. Upgrade pip and install the package:
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training:
pip install ninja
pip install flash-attn --no-build-isolation

Data Preparation

Please refer to the original LLaVA-HR and Mini-Gemini for data preparation. Or whatever MLLM's offical repo you are using.

Important Notice: For the Finetune stage, you need modify the data JSON file to move the image tokens to the beginning of the sequence. You can refer to modify_data_config.py to do so, or you can follow the steps below:

python modify_data_config.py /path/to/your/llava_v1_5_mix665k.json /path/to/save/your/modified_llava_v1_5_mix665k.json

Training

Stage 1: Pretraining

Please download the caption annotations blip_laion_cc_sbu_558k.json and images from here. Move the downloaded files to the /data/data folder. Then run the following command to start the training process:

bash bash scripts/v1_5/pretrain_llava_hr.sh
We recommend to directly pre-trained projector, here are the link from official LLaVA-HR and Mini-Gemini. Version Vision Encoder Projection Pretrain Data Pretraining schedule Download
LLaVA-HR-7b CLIP-L & ConvNeXt-L MLP-2x LCS-558K 1e projector
LLaVA-HR-X-13b CLIP-L & ConvNeXt-XXL MLP-2x LCS-558K 1e projector
Mini-Gemini-HD-7b CLIP-L MLP-2x MGM-Pretrain 1e projector

Stage 2: $\gamma$-MOD Fine-Tuning

Step 1: ARank analysis

Please run the stage-1 alignment model on any dataset you wish to compute the ARank.We will use sqa as an example.

bash scripts/v1_5/eval_full/arank.sh /path/to/your/stage1_checkpoint 
We also provide the stage-1 checkpoint for your convenience. Version Download
$\gamma$-MOD-llava-hr-7b-stage1 model
$\gamma$-MOD-Mini-Gemini-HD-7b-stage1 model
Step 2: Fine-Tuning

After you get the ARank, you can use the ARank to replace the dense layers in the original model. Reference to llava_llama_mod.py file and the initialize_mod_modules function. Then train the model with the following command:

bash /path/to/your/fine_tune_mod.sh
We also provide the stage-2 sft checkpoint for your convenience. Version Download
$\gamma$-MOD-llava-hr-7b-0.34 model
$\gamma$-MOD-llava-hr-13b-0.34 model
$\gamma$-MOD-llava-hr-13b-0.5 model
$\gamma$-MOD-Mini-Gemini-HD-7b-0.34 model
$\gamma$-MOD-Mini-Gemini-HD-7b-0.5 model

⚖️ Evaluation

We follow LLaVA-v1.5 to conduct evaluations. you should download eval.zip and unzip it to ./playground/data/eval. Please refer to Evaluation.md to prepare the data.

Then, your can run our evaluation script bash scripts/v1_5/eval.sh.

🔬 Experiments

$\gamma$-MOD was tested on three popular MLLMs across 9 benchmark datasets.

Experimental Results Experimental Results


📈 Results

Model Training Time Reduction Inference Time Reduction Accuracy
$\gamma$-MoD-LLaVA-HR-7B 31.0% 53.2% -1.5%
$\gamma$-MoD-LLaVA-HR-13B 18.8% 50.4% -0.3%
$\gamma$-MoD-LLaVA-HR-X-13B 17.4% 58.6% +0.4%
$\gamma$-MoD-Mini-Gemini-HD-7B 41.0% 58.1% -1.0%

For more details, check the full report.


📖 Citation

If you use $\gamma$-MOD in your work, please cite:

@misc{luo2024gammamodexploringmixtureofdepthadaptation,
      title={$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models}, 
      author={Yaxin Luo and Gen Luo and Jiayi Ji and Yiyi Zhou and Xiaoshuai Sun and Zhiqiang Shen and Rongrong Ji},
      year={2024},
      eprint={2410.13859},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.13859}, 
}

📧 Contact

For questions, please reach out to Yaxin Luo.


📜 License

This project is licensed under the MIT License - see the LICENSE file for details.


👀 Acknowledgments

Special thanks to all contributors and the LLaVA & LLaVA-HR & MGM project for codebase.

We are also thankful to LLaVA-pp, MoE-LLaVA for releasing their models and code as open-source contributions.

Star History

Star History Chart