Official codebase for Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [ArXiv] [ICLR 2024 Workshop on ME-FoMo][Blog Post].
conda create -n shortened-llm python=3.9
conda activate shortened-llm
git clone https://github.com/Nota-NetsPresso/shortened-llm.git
cd shortened-llm
pip install -r requirement.txt
Source Model |
Pruning Ratio |
Pruning Criterion |
🤗Hugging Face Link |
---|---|---|---|
Vicuna-v1.3-7B | 20% | PPL | nota-ai/cpt_st-vicuna-v1.3-5.5b-ppl |
Vicuna-v1.3-7B | 45% | PPL | nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl |
Vicuna-v1.3-7B | 60% | PPL | nota-ai/cpt_st-vicuna-v1.3-2.7b-ppl |
Vicuna-v1.3-7B | 80% | PPL | nota-ai/cpt_st-vicuna-v1.3-1.5b-ppl |
Source Model |
Pruning Ratio |
Pruning Criterion |
🤗Hugging Face Link |
---|---|---|---|
LLaMA-1-7B | 20% | PPL | nota-ai/st-llama-1-5.5b-ppl |
LLaMA-1-7B | 20% | Taylor+ | nota-ai/st-llama-1-5.5b-taylor |
Vicuna-v1.3-7B | 20% | PPL | nota-ai/st-vicuna-v1.3-5.5b-ppl |
Vicuna-v1.3-7B | 20% | Taylor+ | nota-ai/st-vicuna-v1.3-5.5b-taylor |
Vicuna-v1.3-13B | 21% | PPL | nota-ai/st-vicuna-v1.3-10.5b-ppl |
Vicuna-v1.3-13B | 21% | Taylor+ | nota-ai/st-vicuna-v1.3-10.5b-taylor |
The scripts perform (1) block pruning ➔ (2) LoRA-based retraining ➔ (3) zero-shot evaluation.
LlamaForCausalLM
)
bash script/prune_llama-7b_crit-ppl.sh
bash script/prune_llama-7b_crit-taylor.sh
LlamaForCausalLM
)
bash script/prune_llama2-7b_crit-ppl.sh
bash script/prune_llama2-7b_crit-taylor.sh
LlamaForCausalLM
)
bash script/prune_llama3-8b_crit-ppl.sh
bash script/prune_llama3-8b_crit-taylor.sh
LlamaForCausalLM
)
bash script/prune_vicuna-7b_crit-ppl.sh
bash script/prune_vicuna-7b_crit-taylor.sh
LlamaForCausalLM
)
bash script/prune_vicuna-13b_crit-ppl.sh
bash script/prune_vicuna-13b_crit-taylor.sh
MistralForCausalLM
)
bash script/prune_CatPPT_crit-ppl.sh
bash script/prune_CatPPT_crit-taylor.sh
GemmaForCausalLM
)
bash script/prune_gemma-2b_crit-ppl_yesBOS.sh
bash script/prune_gemma-2b_crit-taylor_yesBOS.sh
GemmaForCausalLM
)
bash script/prune_gemma-7b_crit-ppl_yesBOS.sh
bash script/prune_gemma-7b_crit-taylor_yesBOS.sh
To test other pruning ratios, use:
bash script/prune.sh
To obtain baselines using the magnitude pruning criterion, use:
bash script/prune_llama-7b_crit-magnitude.sh
bash script/prune_vicuna-7b_crit-magnitude.sh
bash script/prune_vicuna-13b_crit-magnitude.sh
To measure (1) PPL on WikiText2 & PTB, and (2) accuracy on seven commonsense reasoning tasks, use: (EleutherAI/lm-evaluation-harness version 3326c54)
bash script/evaluate.sh
(Optional) Any post-training quantization method can be applied to our pruned models. The example script quantizes our pruned models using GPTQ and measures their performance with script/evaluate.sh
:
bash script/quantize_gptq_vicuna-7b.sh
To measure latency & throughput, use:
bash script/measure_time.sh
To measure VRAM requirements, use:
bash script/measure_vram.sh
To measure GPU compute utilization, use:
bash script/measure_gpuutil.sh
The demo compares the use of LLM-Pruner (Ma et al., 2023; width pruning) and Shortened LLaMA (Ours; depth pruning) for the LLaMA-1-7B model:
pip install transformers==4.33.1 # to run LLM-Pruner's model
python src/app.py
@article{kim2024shortened,
title={Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods},
author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
journal={arXiv preprint arXiv:2402.02834},
year={2024},
url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
year={2024},
url={https://openreview.net/forum?id=18VGxuOdpu}
}