-
It would be good if the KV Key cache type could be set in Ollama.
llama.cpp allows you to set the Key cache type which can improve memory usage as the KV store increases in size, especially when ru…
-
**Describe the bug**
Following [readme](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_kv_cache/README.md) here I cannot get a fp8 weight activation and kv cache quant…
-
### Your current environment
The output of `python collect_env.py`
I attempted to run that, but it threw errors. I'm running this in docker on Windows 11
### Model Input Dumps
_No response_…
-
Hi thanks for the lib! When checking https://github.com/vllm-project/llm-compressor/issues/935, it seems that `one_shot` auto saves everything to the output folder. That looks great, but if I understa…
-
Hi, thanks for your great work!
I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kerne…
-
**Description**
[Flash attention](https://github.com/ggerganov/llama.cpp/pull/5021) and [quantized kv stores](https://github.com/ggerganov/llama.cpp/discussions/59320) are both supported by llama.cpp…
-
### Your current environment
vllm 0.5.2
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to b…
piood updated
2 weeks ago
-
### Your current environment
I want to deply neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 with 8 x NVIDIA L20,
use -tensor-parallel-size=8 --enforce-eager --trust-remote-code --quantization=fp8 --kv…
-
1. Is kv-cache actually **not used** in all the LLM-evaluation tasks, since those tasks usually takes **only one-step** attention calculation, not like language generating process which needs a lot of…
-
Quark is a comprehensive cross-platform toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimiz…