October2001 / Awesome-KV-Cache-Compression

📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
MIT License
73 stars 1 forks source link
awesome-list large-language-models papers
[![LICENSE](https://img.shields.io/github/license/October2001/Awesome-KV-Cache-Compression)](https://github.com/October2001/Awesome-KV-Cache-Compression/blob/main/LICENSE) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) [![commit](https://img.shields.io/github/last-commit/October2001/Awesome-KV-Cache-Compression?color=blue)](https://github.com/October2001/Awesome-KV-Cache-Compression/commits/main) [![PR](https://img.shields.io/badge/PRs-Welcome-red)](https://github.com/October2001/Awesome-KV-Cache-Compression/pulls) [![GitHub Repo stars](https://img.shields.io/github/stars/October2001/Awesome-KV-Cache-Compression)](https://github.com/October2001/Awesome-KV-Cache-Compression)

📢 News

🎉 [2024-07-23] Project Beginning 🥳

📜 Notice

This repository is constantly updating 🤗 ...

You can directly click on the title to jump to the corresponding PDF link location

📷 Survey

  1. Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption. Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai . COLM 2024.

🔍 Method

1️⃣ Pruning / Evicting / Sparse

  1. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava. NeurIPS 2023.

  2. SnapKV: LLM Knows What You are Looking for Before Generation. Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen. Arxiv 2024. GitHub Repo stars

  3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen. NeurIPS 2023. GitHub Repo stars

  4. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao. ICLR 2024.

  5. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference. Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao. ACL 2024. GitHub Repo stars

  6. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao. Arxiv 2024. GitHub Repo stars

  7. Transformers are Multi-State RNNs. Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz. Arxiv 2024. GitHub Repo stars

  8. Efficient Streaming Language Models with Attention Sinks. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. ICLR 2024. GitHub Repo stars

  9. A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression. Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini. Arxiv 2024.

  10. Retrieval Head Mechanistically Explains Long-Context Factuality. Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu. Arxiv 2024. GitHub Repo stars

  11. Efficient Sparse Attention needs Adaptive Token Release. Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li. ACL 2024. GitHub Repo stars

  12. Loki: Low-Rank Keys for Efficient Sparse Attention. Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele. Arxiv 2024.

  13. Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference. Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen. Arxiv 2024. GitHub Repo stars

  14. ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching. Youpeng Zhao, Di Wu, Jun Wang. ISCA 2024.

  15. Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference. Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath. Arxiv 2024. GitHub Repo stars

  16. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou. Arxiv 2024. GitHub Repo stars

  17. Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters. Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe. Arxiv 2024.

  18. On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference. Siyu Ren, Kenny Q. Zhu. Arxiv 2024. GitHub Repo stars

  19. CORM: Cache Optimization with Recent Message for Large Language Model Inference. Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei Bi, Shuming Shi. Arxiv 2024.

  20. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads. Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang. Arxiv 2024.

  21. ThinK: Thinner Key Cache by Query-Driven Pruning. Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo. Arxiv 2024.

  22. A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder. Hyun Rae Jo, Dong Kun Shin. Arxiv 2024. GitHub Repo stars

  23. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han. ICML 2024. GitHub Repo stars

  24. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference. Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi. Arxiv 2024.

  25. NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time. Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu. ACL 2024. GitHub Repo stars

  26. Post-Training Sparse Attention with Double Sparsity. Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng. Arxiv 2024. GitHub Repo stars

  27. Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope. Xiaoran Liu, Qipeng Guo, Yuerong Song, Zhigeng Liu, Kai Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu. Arxiv 2024.

  28. Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference. Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti. ICML 2024.

  29. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu. NeurIPS 2024. GitHub Repo stars

  30. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers. Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann. NeurIPS 2023.

  31. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval. Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu. Arxiv 2024.

  32. Sirius: Contextual Sparsity with Correction for Efficient LLMs. Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen. Arxiv 2024. GitHub Repo stars

  33. Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU. Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo. Arxiv 2024. GitHub Repo stars

  34. Training-Free Activation Sparsity in Large Language Models. James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun. Arxiv 2024. GitHub Repo stars

  35. KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models. Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma. Arxiv 2024.

  36. CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs. Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie. Arxiv 2024.

  37. Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction. Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty. Arxiv 2024.

  38. KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head. Isaac Rehg. Arxiv 2024. GitHub Repo stars

  39. InfiniPot: Infinite Context Processing on Memory-Constrained LLMs. Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang. EMNLP 2024.

  40. Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads. Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu. Arxiv 2024. GitHub Repo stars

  41. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference. Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang. Arxiv 2024. GitHub Repo stars

2️⃣ Merging

  1. D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models. Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang. Arxiv 2024.

  2. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks. Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang. Arxiv 2024.

  3. CaM: Cache Merging for Memory-efficient LLMs Inference. Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji. ICML 2024. GitHub Repo stars

  4. Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs. Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin. ICLR 2024. GitHub Repo stars

  5. Token Merging: Your ViT But Faster. Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman. ICLR 2023. GitHub Repo stars

  6. LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference. Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan. EMNLP 2024. GitHub Repo stars

  7. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal. Arxiv 2024.

  8. Compressed Context Memory for Online Language Model Interaction. Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song. ICLR 2024. GitHub Repo stars

3️⃣ Cross-Layer

  1. You Only Cache Once: Decoder-Decoder Architectures for Language Models. Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei. NeurIPS 2024. GitHub Repo stars

  2. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention. William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly. Arxiv 2024.

  3. Layer-Condensed KV Cache for Efficient Inference of Large Language Models. Haoyi Wu, Kewei Tu. ACL 2024. GitHub Repo stars

  4. MiniCache: KV Cache Compression in Depth Dimension for Large Language Models. Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang. Arxiv 2024. GitHub Repo stars

  5. MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding. Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji. Arxiv 2024. GitHub Repo stars

4️⃣ Low-Rank

  1. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai. EMNLP 2023.

  2. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. DeepSeek-AI. Arxiv 2024. GitHub Repo stars

  3. Effectively Compress KV Heads for LLM. Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu. Arxiv 2024.

  4. Palu: Compressing KV-Cache with Low-Rank Projection. Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu. Arxiv 2024. GitHub Repo stars

5️⃣ Quantization

  1. ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification. Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang. Arxiv 2024.

  2. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization. June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee. Arxiv 2024.

  3. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu. ICML 2024. GitHub Repo stars

  4. GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM. Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao. Arxiv 2024. GitHub Repo stars

  5. PQCache: Product Quantization-based KVCache for Long Context LLM Inference. Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui. Arxiv 2024.

  6. Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression. Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen. Arxiv 2024.

  7. SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models. Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin. Arxiv 2024. GitHub Repo stars

  8. QAQ: Quality Adaptive Quantization for LLM KV Cache. Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang. Arxiv 2024. GitHub Repo stars

  9. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami. NeurIPS 2024. GitHub Repo stars

  10. WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More. Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie. Arxiv 2024.

6️⃣ Prompt Compression

  1. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu. EMNLP 2023. GitHub Repo stars

  2. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang. ACL 2024. GitHub Repo stars

  3. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu. ACL 2024. GitHub Repo stars

  4. TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning. Shivam Shandilya, Menglin Xia, Supriyo Ghosh, Huiqiang Jiang, Jue Zhang, Qianhui Wu, Victor Rühle. Arxiv 2024.

📊 Evaluation

  1. KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches. Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu. EMNLP 2024. GitHub Repo stars