📒Introduction
Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes. This repository is for personal use of learning and classifying the burning KV Cache related papers!
©️Citations
📖Contents
📖Trending Inference Topics (©️back👆🏻)
Date |
Title |
Paper |
Code |
Recom |
Comment |
2024.05 |
🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) |
[pdf] |
[DeepSeek-V2] |
⭐️⭐️⭐️ |
2024.05 |
🔥🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft) |
[pdf] |
[unilm-YOCO] |
⭐️⭐️⭐️ |
2024.06 |
🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |
[pdf] |
[Mooncake] |
⭐️⭐️⭐️ |
2024.07 |
🔥🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) |
[pdf] |
[flash-attention] |
⭐️⭐️⭐️ |
2024.07 |
🔥🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) |
[pdf] |
[MInference 1.0] |
⭐️⭐️⭐️ |
LLM KV Cache Compression (©️back👆🏻)
Date |
Title |
Paper |
Code |
Recom |
Comment |
2023.06 |
🔥🔥[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models |
[pdf] |
[H2O] |
⭐️⭐️⭐️ |
Attention-based selection |
2023.09 |
🔥🔥🔥[StreamingLLM] Efficient Streaming Language Models with Attention Sinks |
[pdf] |
[streaming-llm] |
⭐️⭐️⭐️ |
Retain first few tokens |
2023.10 |
🔥[FastGen] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs |
[pdf] |
|
⭐️⭐️ |
Head-specific compression strategies |
2023.10 |
🔥🔥[CacheGen] KV Cache Compression and Streaming for Fast Large Language Model Serving |
[pdf] |
[LMCache] |
⭐️⭐️⭐️ |
Compress KV cache to bitstreams for storage and sharing |
2024.04 |
🔥🔥[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation |
[pdf] |
[SnapKV] |
⭐️⭐️⭐️ |
Attention Pooling before selection |
2024.05 |
[Scissorhands] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time |
[pdf] |
|
⭐️ |
2024.06 |
🔥A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression |
[pdf] |
|
⭐️ |
L2 Norm is better than attention as a metrics |
2024.06 |
CORM: Cache Optimization with Recent Message for Large Language Model Inference |
[pdf] |
|
⭐️ |
2024.07 |
Efficient Sparse Attention needs Adaptive Token Release |
[pdf] |
|
⭐️ |
2024.03 |
[ALISA] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching |
[pdf] |
|
⭐️ |
2024.03 |
🔥🔥🔥[FastV] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models |
[pdf] |
[EasyKV] |
⭐️⭐️⭐️ |
2024.03 |
[Keyformer] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference |
[pdf] |
[keyformer-llm] |
⭐️⭐️ |
2024.06 |
Effectively Compress KV Heads for LLM |
[pdf] |
|
⭐️ |
2024.06 |
🔥 Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters |
[pdf] |
|
⭐️ |
2024.06 |
On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference |
[pdf] |
[EasyKV] |
⭐️ |
KV Cache Merge (©️back👆🏻)
Date |
Title |
Paper |
Code |
Recom |
Comment |
2023.10 |
🔥🔥[CacheBlend] Fast Large Language Model Serving for RAG with Cached Knowledge Fusion |
[pdf] |
[LMCache] |
⭐️⭐️⭐️ |
Selective update when merging KV caches |
2023.12 |
🔥 Compressed Context Memory For Online Language Model Interaction |
[pdf] |
[ContextMemory] |
⭐️⭐️⭐️ |
Finetuning LLMs to recurrently compress KV caches |
2024.01 |
[CaM] CaM: Cache Merging for Memory-efficient LLMs Inference |
[pdf] |
[cam] |
⭐️⭐️ |
2024.05 |
🔥🔥 You Only Cache Once: Decoder-Decoder Architectures for Language Models |
[pdf] |
[unilm] |
⭐️⭐️ |
2024.06 |
🔥🔥[D2O] D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models |
[pdf] |
|
⭐️⭐️⭐️ |
2024.07 |
🔥 [KVMerger]Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks |
[pdf] |
|
⭐️⭐️⭐️ |
Budget Allocation (©️back👆🏻)
Date |
Title |
Paper |
Code |
Recom |
Comment |
2024.05 |
🔥[PyramidInfer] PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference |
[pdf] |
[PyramidInfer] |
⭐️⭐️⭐️ |
Layer-wise budget allocation |
2024.06 |
🔥[PyramidKV] PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling |
[pdf] |
[PyramidKV] |
⭐️⭐️⭐️ |
Layer-wise budget allocation |
2024.07 |
🔥[Ada-KV] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference |
[pdf] |
|
⭐️⭐️⭐️ |
Head-wise budget allocation |
2024.07 |
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads |
[pdf] |
|
⭐️ |
Cross-Layer KV Cache Utilization (©️back👆🏻)
Date |
Title |
Paper |
Code |
Recom |
Comment |
2024.05 |
🔥 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention |
[pdf] |
|
⭐️ |
2024.05 |
🔥 Layer-Condensed KV Cache for Efficient Inference of Large Language Models |
[pdf] |
[LCKV] |
⭐️⭐️ |
2024.05 |
🔥🔥🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models |
[pdf] |
|
⭐️⭐️⭐️ |
2024.06 |
🔥[MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding |
[pdf] |
[pythia-mlkv] |
⭐️⭐️ |
KV Cache Quantization (©️back👆🏻)
Date |
Title |
Paper |
Code |
Recom |
Comment |
2023.03 |
🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM |
[pdf] |
[GEAR] |
⭐️⭐️ |
2024.01 |
🔥🔥[KVQuant] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization |
[pdf] |
[KVQuant] |
⭐️⭐️ |
Make all KV cache quantized |
2024.02 |
[No Token Left Behind] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization |
[pdf] |
|
⭐️⭐️⭐️ |
2024.02 |
[KIVI] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache |
[pdf] |
[KIVI] |
⭐️⭐️ |
2024.02 |
[WKVQuant] WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More |
[pdf] |
2024.03 |
[QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache |
[pdf] |
[QAQ-KVCacheQuantization] |
⭐️ |
attention-based KV cache quantized |
2024.05 |
[ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification |
[pdf] |
|
⭐️ |
2024.05 |
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression |
[pdf] |
|
⭐️ |
2024.05 |
[SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models |
[pdf] |
[SKVQ] |
⭐️ |
2024.07 |
[PQCache] PQCache: Product Quantization-based KVCache for Long Context LLM Inference |
[pdf] |
|
⭐️ |
https://arxiv.org/abs/2402.12065
Date |
Title |
Paper |
Code |
Recom |
Comment |
2024.07 |
🔥[Benchmark] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches |
[pdf] |
|
⭐️ |
Low Rank KV Cache Decomposition (©️back👆🏻)
Date |
Title |
Paper |
Code |
Recom |
Comment |
2024.02 |
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference |
[pdf] |
[LESS] |
⭐️⭐️⭐️ |
Fine-tune to make the KV cache low-ranked |
2024.05 |
🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model |
[pdf] |
[DeepSeek-V2] |
⭐️⭐️⭐️ |
Train low-rank KV cache from scratch |
2024.06 |
[Loki] Loki: Low-Rank Keys for Efficient Sparse Attention |
[pdf] |
|
⭐️ |
Date |
Title |
Paper |
Code |
Recom |
Comment |
2022.09 |
In-context Learning and Induction Heads |
[pdf] |
⭐️⭐️ |
2024.01 |
🔥Transformers are Multi-State RNNs |
[pdf] |
[TOVA] |
⭐️⭐️ |
2024.04 |
🔥[Retrieval Head] Retrieval Head Mechanistically Explains Long-Context Factuality |
[pdf] |
[Retrieval_Head] |
⭐️⭐️⭐️ |
2024.04 |
🔥[Massive Activations] Massive Activations in Large Language Models |
[pdf] |
[Massive Activation] |
⭐️⭐️⭐️ |
Date |
Title |
Paper |
Code |
Recom |
Comment |
2024.06 |
🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) |
[pdf] |
[Mooncake] |
⭐️⭐️⭐️ |
2024.02 |
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool |
[pdf] |
|
⭐️ |
Date |
Title |
Paper |
Code |
Recom |
Comment |
2024.02 |
Effectively Compress KV Heads for LLM |
[pdf] |
|
⭐️ |
2024.07 |
🔥🔥Q-Sparse: All Large Language Models can be Fully Sparsely-Activated |
[pdf] |
[GeneralAI] |
⭐️⭐️⭐️ |
©️License
GNU General Public License v3.0
🎉Contribute
Welcome to star & submit a PR to this repo!
@misc{Awesome-LLM-Inference@2024,
title={Awesome-LLM-KV-Cache: A curated list of Awesome LLM Inference Papers with codes},
url={https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
note={Open-source software available at https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
author={Zefan-Cai, etc},
year={2024}
}