NVIDIA / kvpress

LLM KV cache compression made easy
Apache License 2.0
214 stars 5 forks source link

add support for QuantizedCache #5

Closed SimJeg closed 4 days ago

SimJeg commented 4 days ago

Transformers support KV cache quantization through the QuantizedCache class (see their blog post). I propose to update BasePress and pipeline.py to support it.

Note that it implies to add several installations I did not include in pyproject.toml following their philosophy of not install additional kernels. I noticed issues during installation as mentioned here.