Zefan-Cai/Awesome-LLM-KV-Cache

📒Introduction

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes. This repository is for personal use of learning and classifying the burning KV Cache related papers!

©️Citations

📖Contents

📖Trending Inference Topics🔥🔥🔥
📖KV Cache Compression🔥🔥
📖KV Cache Merge🔥🔥
📖Budget Allocation🔥
📖Cross-Layer KV Cache Utilization🔥
📖KV Cache Quantization🔥
📖Low-Rank KV Cache Decomposition🔥
📖Observation🔥🔥
📖Evaluation🔥
📖Systems
📖Others

📖Trending Inference Topics (©️back👆🏻)

Date	Title	Paper	Code	Recom
2024.05	🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)	[pdf]	[DeepSeek-V2]	⭐️⭐️⭐️
2024.05	🔥🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)	[pdf]	[unilm-YOCO]	⭐️⭐️⭐️
2024.06	🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI)	[pdf]	[Mooncake]	⭐️⭐️⭐️
2024.07	🔥🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc)	[pdf]	[flash-attention]	⭐️⭐️⭐️
2024.07	🔥🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft)	[pdf]	[MInference 1.0]	⭐️⭐️⭐️

LLM KV Cache Compression (©️back👆🏻)

Date	Title	Paper	Code	Recom	Comment
2023.06	🔥🔥[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	[pdf]	[H2O]	⭐️⭐️⭐️	Attention-based selection
2023.09	🔥🔥🔥[StreamingLLM] Efficient Streaming Language Models with Attention Sinks	[pdf]	[streaming-llm]	⭐️⭐️⭐️	Retain first few tokens
2023.10	🔥[FastGen] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs	[pdf]		⭐️⭐️	Head-specific compression strategies
2023.10	🔥🔥[CacheGen] KV Cache Compression and Streaming for Fast Large Language Model Serving	[pdf]	[LMCache]	⭐️⭐️⭐️	Compress KV cache to bitstreams for storage and sharing
2024.04	🔥🔥[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation	[pdf]	[SnapKV]	⭐️⭐️⭐️	Attention Pooling before selection
2024.05	[Scissorhands] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time	[pdf]		⭐️
2024.06	🔥A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression	[pdf]		⭐️	L2 Norm is better than attention as a metrics
2024.06	CORM: Cache Optimization with Recent Message for Large Language Model Inference	[pdf]		⭐️
2024.07	Efficient Sparse Attention needs Adaptive Token Release	[pdf]		⭐️
2024.03	[ALISA] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching	[pdf]		⭐️
2024.03	🔥🔥🔥[FastV] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models	[pdf]	[EasyKV]	⭐️⭐️⭐️
2024.03	[Keyformer] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference	[pdf]	[keyformer-llm]	⭐️⭐️
2024.06	Effectively Compress KV Heads for LLM	[pdf]		⭐️
2024.06	🔥 Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters	[pdf]		⭐️
2024.06	On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference	[pdf]	[EasyKV]	⭐️

KV Cache Merge (©️back👆🏻)

Date	Title	Paper	Code	Recom	Comment
2023.10	🔥🔥[CacheBlend] Fast Large Language Model Serving for RAG with Cached Knowledge Fusion	[pdf]	[LMCache]	⭐️⭐️⭐️	Selective update when merging KV caches
2023.12	🔥 Compressed Context Memory For Online Language Model Interaction	[pdf]	[ContextMemory]	⭐️⭐️⭐️	Finetuning LLMs to recurrently compress KV caches
2024.01	[CaM] CaM: Cache Merging for Memory-efficient LLMs Inference	[pdf]	[cam]	⭐️⭐️
2024.05	🔥🔥 You Only Cache Once: Decoder-Decoder Architectures for Language Models	[pdf]	[unilm]	⭐️⭐️
2024.06	🔥🔥[D2O] D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models	[pdf]		⭐️⭐️⭐️
2024.07	🔥 [KVMerger]Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks	[pdf]		⭐️⭐️⭐️

Budget Allocation (©️back👆🏻)

Date	Title	Paper	Code	Recom	Comment
2024.05	🔥[PyramidInfer] PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference	[pdf]	[PyramidInfer]	⭐️⭐️⭐️	Layer-wise budget allocation
2024.06	🔥[PyramidKV] PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling	[pdf]	[PyramidKV]	⭐️⭐️⭐️	Layer-wise budget allocation
2024.07	🔥[Ada-KV] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference	[pdf]		⭐️⭐️⭐️	Head-wise budget allocation
2024.07	RazorAttention: Efficient KV Cache Compression Through Retrieval Heads	[pdf]		⭐️

Cross-Layer KV Cache Utilization (©️back👆🏻)

Date	Title	Paper	Code	Recom
2024.05	🔥 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention	[pdf]		⭐️
2024.05	🔥 Layer-Condensed KV Cache for Efficient Inference of Large Language Models	[pdf]	[LCKV]	⭐️⭐️
2024.05	🔥🔥🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	[pdf]		⭐️⭐️⭐️
2024.06	🔥[MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding	[pdf]	[pythia-mlkv]	⭐️⭐️

KV Cache Quantization (©️back👆🏻)

Date	Title	Paper	Code	Recom	Comment
2023.03	🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM	[pdf]	[GEAR]	⭐️⭐️
2024.01	🔥🔥[KVQuant] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	[pdf]	[KVQuant]	⭐️⭐️	Make all KV cache quantized
2024.02	[No Token Left Behind] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization	[pdf]		⭐️⭐️⭐️
2024.02	[KIVI] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	[pdf]	[KIVI]	⭐️⭐️
2024.02	[WKVQuant] WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More	[pdf]
2024.03	[QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache	[pdf]	[QAQ-KVCacheQuantization]	⭐️	attention-based KV cache quantized
2024.05	[ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	[pdf]		⭐️
2024.05	Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression	[pdf]		⭐️
2024.05	[SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models	[pdf]	[SKVQ]	⭐️
2024.07	[PQCache] PQCache: Product Quantization-based KVCache for Long Context LLM Inference	[pdf]		⭐️

https://arxiv.org/abs/2402.12065

Evaluation (©️back👆🏻)

Date	Title	Paper	Code	Recom	Comment
2024.07	🔥[Benchmark] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches	[pdf]		⭐️

Low Rank KV Cache Decomposition (©️back👆🏻)

Date	Title	Paper	Code	Recom	Comment
2024.02	Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference	[pdf]	[LESS]	⭐️⭐️⭐️	Fine-tune to make the KV cache low-ranked
2024.05	🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	[pdf]	[DeepSeek-V2]	⭐️⭐️⭐️	Train low-rank KV cache from scratch
2024.06	[Loki] Loki: Low-Rank Keys for Efficient Sparse Attention	[pdf]		⭐️

Date	Title	Paper	Code	Recom
2022.09	In-context Learning and Induction Heads	[pdf]	⭐️⭐️
2024.01	🔥Transformers are Multi-State RNNs	[pdf]	[TOVA]	⭐️⭐️
2024.04	🔥[Retrieval Head] Retrieval Head Mechanistically Explains Long-Context Factuality	[pdf]	[Retrieval_Head]	⭐️⭐️⭐️
2024.04	🔥[Massive Activations] Massive Activations in Large Language Models	[pdf]	[Massive Activation]	⭐️⭐️⭐️

Date	Title	Paper	Code	Recom	Comment
2024.06	🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI)	[pdf]	[Mooncake]	⭐️⭐️⭐️
2024.02	MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool	[pdf]		⭐️

Date	Title	Paper	Code	Recom	Comment
2024.02	Effectively Compress KV Heads for LLM	[pdf]		⭐️
2024.07	🔥🔥Q-Sparse: All Large Language Models can be Fully Sparsely-Activated	[pdf]	[GeneralAI]	⭐️⭐️⭐️

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to star & submit a PR to this repo!

@misc{Awesome-LLM-Inference@2024,
  title={Awesome-LLM-KV-Cache: A curated list of Awesome LLM Inference Papers with codes},
  url={https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
  note={Open-source software available at https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
  author={Zefan-Cai, etc},
  year={2024}
}

Zefan-Cai / Awesome-LLM-KV-Cache

readme