LLMC is an off-the-shell tool designed for compressing LLM, leveraging state-of-the-art compression algorithms to enhance efficiency and reduce model size without compromising performance.
English doc is here.
Chinese doc is here.
Docker hub is here.
Aliyun docker: registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]
You can download the Docker image that can run llmc with the following command. Users in mainland China are recommended to use Alibaba Cloud Docker.
docker hub
docker pull llmcompression/llmc:pure-latest
aliyun docker
docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-latest
Community:
Nov 20, 2024: 🔥 We now fully support the quantization of ✨DeepSeekv2(2.5)
and other MOE
models, as well as ✨Qwen2VL
, Llama3.2
, and other VLM
models. Supported quantization methods include ✅integer quantization, ✅floating-point quantization, and advanced algorithms like ✅AWQ, ✅GPTQ, ✅SmoothQuant, and ✅Quarot.
Nov 12, 2024: 🔥 We have added support for 💥static per-tensor activation quantization
across various models and algorithms, covering ✅integer quantization and ✅floating-point quantization to further optimize performance and efficiency. Additionally, we now support exporting ✨real quantized models
and using the VLLM and SGLang backends for inference acceleration. For more details, refer to the VLLM documentation and SGLang documentation.
Sep 26, 2024: 🔥 We now support exporting 💥FP8 quantized(E4M3, E5M2)
models from 🚀LLMC
to advanced inference backends such as VLLM and SGLang. For detailed usage, please refer to the VLLM documentation and SGLang documentation.
Sep 24, 2024: 🔥 We have officially released ✅INT4 and ✅INT8 models of ✨Llama-3.1-405B
, quantized using 🚀LLMC
in save_lightllm
mode. You can download the model parameters here.
Sep 23, 2024: 🔥 We now support exporting ✨real quantized(INT4, INT8)
models from 🚀LLMC
to advanced inference backends such as VLLM, SGLang, AutoAWQ, and MLC-LLM for quantized inference deployment, enabling ✨reduced memory usage
and ✨faster inference speeds
.
For detailed usage, please refer to the VLLM documentation, SGLang documentation, AutoAWQ documentation, and MLC-LLM documentation.
Sep 9, 2024: 🔥 We provide some configs of our best practice towards superior performance (see Best Practice here).
Sep 3, 2024: 🔥 We support opencompass 🤗 to eval 🚀LLMC
model. Follow this doc and have a try!
Aug 22, 2024: 🔥We support lots of small language models, including current SOTA SmolLM(see Supported Model List).
Aug 22, 2024: 🔥 Additionally, we also support down stream task evaluation through our modified lm-evaluation-harness 🤗. Specifically, people can first employ save_trans
mode(see save
part in Configuration) to save a weight modified model. After obtaining the transformed model, they can directly evaluate the quantized model referring to run_lm_eval.sh. More details can be found in here.
Jul 23, 2024: 🍺🍺🍺 We release a brand new version benchmark paper:
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit.
Ruihao Gong*, Yang Yong*, Shiqiao Gu*, Yushi Huang*, Chengtao Lv, Yunchen Zhang, Xianglong Liu📧, Dacheng Tao
(* denotes equal contribution, 📧 denotes corresponding author.)
💥Comprehensive Algorithm Support: Provides a broad range of ✨SOTA compression algorithms
, including ✅quantization, ✅mixed-precision quantization, and ✅sparsity, while maintaining accuracy consistent with the original repositories. ✨Quantization best practices
(see 🚀Best Practices
here) are also available to ensure optimal performance and efficiency.
💥Supported Formats: Supports both ✨quantization
(integer and floating-point) and ✨sparsity
, specifically including ✅weight-activation, ✅weight-only, ✅mixed-precision quantization, as well as ✅structured and ✅unstructured sparsity.
💥Wide Model Support: Offers support for a diverse array of ✨LLM models
, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen2-vl) models (see Supported Model List).
💥Multi-backend Compatibility: Seamlessly integrates with various backends for enhanced deployment flexibility. Multiple quantization settings and model formats are compatible with a wide range of backends and hardware platforms, such as ✅VLLM, ✅Sglang, ✅LightLLM, ✅MLC-LLM, and ✅AutoAWQ, making it highly versatile(see Section Backend
here).
💥Performance Efficiency: Enables quantization of large LLMs, such as ✨Llama3.1-405B
and ✨DeepSeekV2-236B
, with PPL evaluation on a single A100/H100/H800 GPU
.
Please refer to the 🚀Quick Start
section in the documentation.
✅ BLOOM
✅ LLaMA
✅ LLaMA V2
✅ OPT
✅ Falcon
✅ Mistral
✅ LLaMA V3
✅ Mixtral
✅ Qwen V2
✅ LLaVA
✅ StableLM
✅ Gemma2
✅ Phi2
✅ Phi 1.5
✅ MiniCPM
✅ SmolLM
✅ Qwen MOE
✅ Qwen2-VL
You can add your own model type referring to files under llmc/models/*.py
.
✅ VLLM
✅ LightLLM
✅ Sglang
✅ MLC-LLM
✅ AutoAWQ
✅ Naive
✅ AWQ
✅ GPTQ
✅ OS+
✅ AdaDim
✅ QUIK
✅ SpQR
✅ DGQ
✅ OWQ
✅ HQQ
✅ QuaRot
✅ TesseraQ
✅ Naive(Magnitude)
✅ Wanda
✅ ShortGPT
We develop our code referring to the following repos:
If you find our LLM-QBench paper/llmc toolkit useful or relevant to your research, please kindly cite our paper:
@misc{llmc,
author = {llmc contributors},
title = {llmc: Towards Accurate and Efficient LLM Compression},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ModelTC/llmc}},
}
@misc{gong2024llmqbench,
title={LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models},
author={Ruihao Gong and Yang Yong and Shiqiao Gu and Yushi Huang and Yunchen Zhang and Xianglong Liu and Dacheng Tao},
year={2024},
eprint={2405.06001},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{gong2024llmcbenchmarkinglargelanguage,
title={LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit},
author={Ruihao Gong and Yang Yong and Shiqiao Gu and Yushi Huang and Chentao Lv and Yunchen Zhang and Xianglong Liu and Dacheng Tao},
year={2024},
eprint={2405.06001},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2405.06001},
}