Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li
This is the official implementation of the paper What Matters in Transformers? Not All Attention is Needed . We conduct extensive experiments and analysis to reveal the architecture redundancy within transformer-based Large Language Models (LLMs). Pipeline for Block Drop and Layer Drop is based on the LLaMA-Factory. The quantization is implemented based on the AutoAWQ and AutoGPTQ.
Transformer-based large language models (LLMs) often contain architectural redundancies. In this work, we systematically investigate redundancy across different types of modules, including Blocks, Attention layers, and MLP layers. Surprisingly, we found that Attention layers, the core component of transformers, are particularly redundant. For example, in the Llama-3-70B model, half of the Attention layers can be dropped while maintaining performance. Our observations indicate that this redundancy in Attention layers persists throughout the training process, necessitating Attention Drop. Additionally, dropping Attention layers significantly enhances computational and memory efficiency. Our findings are informative for the ML community and provide insights for future architecture design.
conda create -n llm-drop python=3.10
conda activate llm-drop
git clone https://github.com/CASE-Lab-UMD/LLM-Drop
#For Dropping:
cd ./LLM-Drop
pip install -e .
#For Quantization:
cd ./src/llmtuner/compression/quantization/AutoAWQ
pip install -e .
cd ./src/llmtuner/compression/quantization/AutoAWQ/AutoAWQ_kernels
pip install -e .
cd ./src/llmtuner/compression/quantization/AutoGPTQ
pip install -vvv --no-build-isolation -e .
Download the models (e.g., Mistral-7B, Llama-2 and Llama-3) from HuggingFace. We create new config and modeling files to represent the models by layers or blocks.
The key auto_map
needs to be added in the config.json to utilize the new files.
Take Mistral-7B as an example:
"auto_map": {
"AutoConfig": "configuration_dropped_mistral.MistralConfig",
"AutoModelForCausalLM": "modeling_dropped_mistral.MistralForCausalLM"
},
Additionally, the key drop_attn_list
and drop_mlp_list
respectively mark which Attention layers and MLPs should be dropped based on their layer index. For instance,
"drop_mlp_list": [],
"drop_attn_list": [25, 26, 24, 22],
"drop_mlp_list": [26, 27, 25, 24],
"drop_attn_list": [],
"drop_mlp_list": [26, 25, 24, 27],
"drop_attn_list": [26, 25, 24, 27],
bash scripts/dropping/block_drop.sh
bash scripts/dropping/layer_drop.sh
bash scripts/dropping/layer_drop_joint.sh
These bash scripts will generate the importance scores for blocks/layers, determine which blocks/layers to retain, and create new model configuration files indicating the dropped modules.
Evaluate the performance of the model with dropping some modules on specific tasks:
bash scripts/benchmark/benchmark_lm_eval.sh
The evaluation code is based on EleutherAI/lm-evaluation-harness. To fully reproduce our results, please use this version. It samples few-shot based on the index of the samples, avoiding the issue of result variation with the number of processes during data parallel inference.
Remember to use the modeling files in src/llmtuner/model
to load the Mistral and Llama models.
Evaluate the speedup ratio of the model with dropping some modules:
bash scripts/benchmark/benchmark_speed.sh
Please refer to AutoGPTQ and AutoAWQ. Ensure you carefully install the packages that correspond to your CUDA version. For quantization, use the following scripts:
bash scripts/quantization/awq.sh
bash scripts/quantization/gptq.sh
@misc{he2024matterstransformersattentionneeded,
title={What Matters in Transformers? Not All Attention is Needed},
author={Shwai He and Guoheng Sun and Zheyu Shen and Ang Li},
year={2024},
eprint={2406.15786},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2406.15786},
}
If you have any questions, please contact:
Shwai He: shwaihe@umd.edu
Guoheng Sun: ghsun@umd.edu