Zefan-Cai / Awesome-LLM-KV-Cache

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.
GNU General Public License v3.0
121 stars 6 forks source link
kv-cache kv-cache-compression kv-cache-quantization llm

📒Introduction

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes. This repository is for personal use of learning and classifying the burning KV Cache related papers!

©️Citations

📖Contents

📖Trending Inference Topics (©️back👆🏻)

Date Title Paper Code Recom Comment
2024.05 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) [pdf] [DeepSeek-V2] ⭐️⭐️⭐️
2024.05 🔥🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft) [pdf] [unilm-YOCO] ⭐️⭐️⭐️
2024.06 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) [pdf] [Mooncake] ⭐️⭐️⭐️
2024.07 🔥🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) [pdf] [flash-attention] ⭐️⭐️⭐️
2024.07 🔥🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) [pdf] [MInference 1.0] ⭐️⭐️⭐️

LLM KV Cache Compression (©️back👆🏻)

Date Title Paper Code Recom Comment
2023.06 🔥🔥[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [pdf] [H2O] ⭐️⭐️⭐️ Attention-based selection
2023.09 🔥🔥🔥[StreamingLLM] Efficient Streaming Language Models with Attention Sinks [pdf] [streaming-llm] ⭐️⭐️⭐️ Retain first few tokens
2023.10 🔥[FastGen] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [pdf] ⭐️⭐️ Head-specific compression strategies
2023.10 🔥🔥[CacheGen] KV Cache Compression and Streaming for Fast Large Language Model Serving [pdf] [LMCache] ⭐️⭐️⭐️ Compress KV cache to bitstreams for storage and sharing
2024.04 🔥🔥[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation [pdf] [SnapKV] ⭐️⭐️⭐️ Attention Pooling before selection
2024.05 [Scissorhands] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time [pdf] ⭐️
2024.06 🔥A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression [pdf] ⭐️ L2 Norm is better than attention as a metrics
2024.06 CORM: Cache Optimization with Recent Message for Large Language Model Inference [pdf] ⭐️
2024.07 Efficient Sparse Attention needs Adaptive Token Release [pdf] ⭐️
2024.03 [ALISA] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching [pdf] ⭐️
2024.03 🔥🔥🔥[FastV] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [pdf] [EasyKV] ⭐️⭐️⭐️
2024.03 [Keyformer] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [pdf] [keyformer-llm] ⭐️⭐️
2024.06 Effectively Compress KV Heads for LLM [pdf] ⭐️
2024.06 🔥 Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters [pdf] ⭐️
2024.06 On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference [pdf] [EasyKV] ⭐️

KV Cache Merge (©️back👆🏻)

Date Title Paper Code Recom Comment
2023.10 🔥🔥[CacheBlend] Fast Large Language Model Serving for RAG with Cached Knowledge Fusion [pdf] [LMCache] ⭐️⭐️⭐️ Selective update when merging KV caches
2023.12 🔥 Compressed Context Memory For Online Language Model Interaction [pdf] [ContextMemory] ⭐️⭐️⭐️ Finetuning LLMs to recurrently compress KV caches
2024.01 [CaM] CaM: Cache Merging for Memory-efficient LLMs Inference [pdf] [cam] ⭐️⭐️
2024.05 🔥🔥 You Only Cache Once: Decoder-Decoder Architectures for Language Models [pdf] [unilm] ⭐️⭐️
2024.06 🔥🔥[D2O] D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models [pdf] ⭐️⭐️⭐️
2024.07 🔥 [KVMerger]Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [pdf] ⭐️⭐️⭐️

Budget Allocation (©️back👆🏻)

Date Title Paper Code Recom Comment
2024.05 🔥[PyramidInfer] PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [pdf] [PyramidInfer] ⭐️⭐️⭐️ Layer-wise budget allocation
2024.06 🔥[PyramidKV] PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [pdf] [PyramidKV] ⭐️⭐️⭐️ Layer-wise budget allocation
2024.07 🔥[Ada-KV] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [pdf] ⭐️⭐️⭐️ Head-wise budget allocation
2024.07 RazorAttention: Efficient KV Cache Compression Through Retrieval Heads [pdf] ⭐️

Cross-Layer KV Cache Utilization (©️back👆🏻)

Date Title Paper Code Recom Comment
2024.05 🔥 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention [pdf] ⭐️
2024.05 🔥 Layer-Condensed KV Cache for Efficient Inference of Large Language Models [pdf] [LCKV] ⭐️⭐️
2024.05 🔥🔥🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [pdf] ⭐️⭐️⭐️
2024.06 🔥[MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding [pdf] [pythia-mlkv] ⭐️⭐️

KV Cache Quantization (©️back👆🏻)

Date Title Paper Code Recom Comment
2023.03 🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM [pdf] [GEAR] ⭐️⭐️
2024.01 🔥🔥[KVQuant] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [pdf] [KVQuant] ⭐️⭐️ Make all KV cache quantized
2024.02 [No Token Left Behind] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization [pdf] ⭐️⭐️⭐️
2024.02 [KIVI] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [pdf] [KIVI] ⭐️⭐️
2024.02 [WKVQuant] WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [pdf]
2024.03 [QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache [pdf] [QAQ-KVCacheQuantization] ⭐️ attention-based KV cache quantized
2024.05 [ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [pdf] ⭐️
2024.05 Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [pdf] ⭐️
2024.05 [SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models [pdf] [SKVQ] ⭐️
2024.07 [PQCache] PQCache: Product Quantization-based KVCache for Long Context LLM Inference [pdf] ⭐️

https://arxiv.org/abs/2402.12065

Evaluation (©️back👆🏻)

Date Title Paper Code Recom Comment
2024.07 🔥[Benchmark] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [pdf] ⭐️

Low Rank KV Cache Decomposition (©️back👆🏻)

Date Title Paper Code Recom Comment
2024.02 Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [pdf] [LESS] ⭐️⭐️⭐️ Fine-tune to make the KV cache low-ranked
2024.05 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model [pdf] [DeepSeek-V2] ⭐️⭐️⭐️ Train low-rank KV cache from scratch
2024.06 [Loki] Loki: Low-Rank Keys for Efficient Sparse Attention [pdf] ⭐️

Observation (©️back👆🏻)

Date Title Paper Code Recom Comment
2022.09 In-context Learning and Induction Heads [pdf] ⭐️⭐️
2024.01 🔥Transformers are Multi-State RNNs [pdf] [TOVA] ⭐️⭐️
2024.04 🔥[Retrieval Head] Retrieval Head Mechanistically Explains Long-Context Factuality [pdf] [Retrieval_Head] ⭐️⭐️⭐️
2024.04 🔥[Massive Activations] Massive Activations in Large Language Models [pdf] [Massive Activation] ⭐️⭐️⭐️

Systems (©️back👆🏻)

Date Title Paper Code Recom Comment
2024.06 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) [pdf] [Mooncake] ⭐️⭐️⭐️
2024.02 MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool [pdf] ⭐️

Others (©️back👆🏻)

Date Title Paper Code Recom Comment
2024.02 Effectively Compress KV Heads for LLM [pdf] ⭐️
2024.07 🔥🔥Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [pdf] [GeneralAI] ⭐️⭐️⭐️

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to star & submit a PR to this repo!

@misc{Awesome-LLM-Inference@2024,
  title={Awesome-LLM-KV-Cache: A curated list of Awesome LLM Inference Papers with codes},
  url={https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
  note={Open-source software available at https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
  author={Zefan-Cai, etc},
  year={2024}
}