MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

AnswerDotAI / cold-compress

Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.

https://www.answer.ai/posts/2024-08-01-cold-compress.html

BSD 3-Clause "New" or "Revised" License

85 stars 8 forks source link

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention #29

Open griff4692 opened 3 months ago

griff4692 commented 3 months ago

Implement this paper.

Similar to class KVCacheFastGen in that it involves a profiling step.