Transformers support KV cache quantization through the QuantizedCache class (see their blog post). I propose to update BasePress and pipeline.py to support it.
Note that it implies to add several installations I did not include in pyproject.toml following their philosophy of not install additional kernels. I noticed issues during installation as mentioned here.
Transformers support KV cache quantization through the
QuantizedCache
class (see their blog post). I propose to updateBasePress
andpipeline.py
to support it.Note that it implies to add several installations I did not include in
pyproject.toml
following their philosophy of not install additional kernels. I noticed issues during installation as mentioned here.