Nota-NetsPresso / shortened-llm

Compressed LLMs for Efficient Text Generation [ICLR'24 Workshop]
63 stars 8 forks source link
large-language-models lightweight pruning pytorch

Shortened LLM by Nota AI

Official codebase for Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [ArXiv] [ICLR 2024 Workshop on ME-FoMo][Blog Post].

teaser

Installation

  conda create -n shortened-llm python=3.9
  conda activate shortened-llm
  git clone https://github.com/Nota-NetsPresso/shortened-llm.git
  cd shortened-llm
  pip install -r requirement.txt
Note on package versions: - Part of the below repositories is included for evaluation: - `src/LLMPruner`: horseee/LLM-Pruner version [213ffa4](https://github.com/horseee/LLM-Pruner/tree/213ffa4d02f92f16d29219a97fd01a8622db1550) - `src/lm_eval`: EleutherAI/lm-evaluation-harness version [3326c54](https://github.com/EleutherAI/lm-evaluation-harness/tree/3326c547a733d598b4377e54be96e194861b964c) - Torch version used in our experiments: `2.0.1` for RTX3090 & A100; `2.1.1` for H100.
(optional) GPTQ Support: - Post-training quantization can be further applied to our pruned model. - We applied GPTQ on the pruned & re-trained models. - repo: [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/tree/v0.7.1) version `0.7.1` - To install the required packages, we recommend installation from source as follows: ```bash git clone https://github.com/AutoGPTQ/AutoGPTQ.git cd AutoGPTQ git checkout v0.7.1 pip install -vvv -e . ```

Models from Aggressive Pruning & CPT Retraining (arXiv-v2):

Source
Model
Pruning
Ratio
Pruning
Criterion
🤗Hugging Face
Link
Vicuna-v1.3-7B 20% PPL nota-ai/cpt_st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 45% PPL nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl
Vicuna-v1.3-7B 60% PPL nota-ai/cpt_st-vicuna-v1.3-2.7b-ppl
Vicuna-v1.3-7B 80% PPL nota-ai/cpt_st-vicuna-v1.3-1.5b-ppl
Click to see the results: - EleutherAI/lm-evaluation-harness version [3326c54](https://github.com/EleutherAI/lm-evaluation-harness/tree/3326c547a733d598b4377e54be96e194861b964c) results

Models from Moderate Pruning & LoRA Retraining (arXiv-v1):

Source
Model
Pruning
Ratio
Pruning
Criterion
🤗Hugging Face
Link
LLaMA-1-7B 20% PPL nota-ai/st-llama-1-5.5b-ppl
LLaMA-1-7B 20% Taylor+ nota-ai/st-llama-1-5.5b-taylor
Vicuna-v1.3-7B 20% PPL nota-ai/st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 20% Taylor+ nota-ai/st-vicuna-v1.3-5.5b-taylor
Vicuna-v1.3-13B 21% PPL nota-ai/st-vicuna-v1.3-10.5b-ppl
Vicuna-v1.3-13B 21% Taylor+ nota-ai/st-vicuna-v1.3-10.5b-taylor
Click to see the results: - EleutherAI/lm-evaluation-harness version [3326c54](https://github.com/EleutherAI/lm-evaluation-harness/tree/3326c547a733d598b4377e54be96e194861b964c) results

Examples

The scripts perform (1) block pruning ➔ (2) LoRA-based retraining ➔ (3) zero-shot evaluation.

Other Scripts

Gradio Demo: Width✄ vs. Depth✄

The demo compares the use of LLM-Pruner (Ma et al., 2023; width pruning) and Shortened LLaMA (Ours; depth pruning) for the LLaMA-1-7B model:

  pip install transformers==4.33.1 # to run LLM-Pruner's model
  python src/app.py
Click to see a demo screenshot (on an A100 80GB GPU): demo

License

Acknowledgments

Citation

@article{kim2024shortened,
  title={Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}