-
It would be good if the KV Key cache type could be set in Ollama.
llama.cpp allows you to set the Key cache type which can improve memory usage as the KV store increases in size, especially when ru…
-
**Describe the bug**
Following [readme](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_kv_cache/README.md) here I cannot get a fp8 weight activation and kv cache quant…
-
### Your current environment
The output of `python collect_env.py`
I attempted to run that, but it threw errors. I'm running this in docker on Windows 11
### Model Input Dumps
_No response_…
-
### Feature request
Enable Quantize KV Cache for Mistral Model, as described in #30483.
### Motivation
KV cache quantization has emerged as a crucial optimization, particularly in high-throughput, …
-
Hi, thanks for your great work!
I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kerne…
-
### 🐛 Describe the bug
when I compile the pte file from llama-7b-chat as indicated by "https://pytorch.org/executorch/stable/build-run-vulkan.html", I find that the generated ptr file size is too big…
-
Hi thanks for the lib! When checking https://github.com/vllm-project/llm-compressor/issues/935, it seems that `one_shot` auto saves everything to the output folder. That looks great, but if I understa…
-
1. Is kv-cache actually **not used** in all the LLM-evaluation tasks, since those tasks usually takes **only one-step** attention calculation, not like language generating process which needs a lot of…
-
**Description**
[Flash attention](https://github.com/ggerganov/llama.cpp/pull/5021) and [quantized kv stores](https://github.com/ggerganov/llama.cpp/discussions/59320) are both supported by llama.cpp…
-
### Description of the bug:
I tried running the example.py script given for quantization example, but for Llama. Wherever the reference to Gemma was made, I made appropriate references to Llama. The…