SqueezeAILab / KVQuant

[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
https://arxiv.org/abs/2401.18079
305 stars 25 forks source link
compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression natural-language-processing quantization small-models text-generation transformer

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [Paper]

Thumbnail

KVQuant is a methodology for efficient KV cache quantization that incorporates several innovations to acheive accurate low-precision quantization, thereby enabling efficient long context length inference.

TLDR: KVQuant addresses the memory bottleneck with long context length inference by quantizing the KV cache to low precision. KVQuant achieves high accuracy with low-precision KV cache quantization by considering several consistent patterns observed in cached KV values across different LLMs, and by developing methods to exploit these patterns, including:

KVQuant enables serving the LLaMA-7B model with 1M context length on a single A100-80GB GPU, or even the LLaMA-7B model with 10M context length on an 8-GPU system 🔥

[TLDR: Twitter Thread] [Paper]


Long Context Length Inference with Large World Model

Large World Model (LWM) is a recent work that enables training long context length models with up to 1M context length. However, inferring these models is extremely resource intensive due to the large size of the KV cache that must be stored throughout inference. Using KVQuant, we can now infer these long context length models efficiently on a single A100!

The lmw/ directory contains scripts for running inference and evaluation using the quantized Large World Models.


Additional Method Improvements

To further improve our methodology for supporting long context length inference, we have made several improvements:


Installation

The codebase contains five subfolders, each of which has its own README file with instructions that you can follow for installing the required environment for that step.


How the code is structured

To reproduce the perplexity numbers reported in the paper, run gradients and then quant.


Roadmap:


Acknowledgement

This code reuses components from several libraries including GPTQ, GPTQ-For-LLaMA, and SqueezeLLM.


Citation

KVQuant has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:

@article{hooper2024kvquant,
  title={KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization},
  author={Hooper, Coleman and Kim, Sehoon and Mohammadzadeh, Hiva and Mahoney, Michael W and Shao, Yakun Sophia and Keutzer, Kurt and Gholami, Amir},
  journal={arXiv preprint arXiv:2401.18079},
  year={2024}
}