kv-cache-quantization Search Results

ollama/ollama #5091

KV Cache Quantization

It would be good if the KV Key cache type could be set in Ollama. llama.cpp allows you to set the Key cache type which can improve memory usage as the KV store increases in size, especially when ru…

sammcj updated 1 week ago

vllm-project/llm-compressor #660

KV Cache Quantization example cause problem

**Describe the bug** Following [readme](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_kv_cache/README.md) here I cannot get a fp8 weight activation and kv cache quant…

weicheng59 updated 1 week ago

vllm-project/vllm #10411

[Bug]: KV Cache Quantization with GGUF turns out quite poorl…

### Your current environment The output of `python collect_env.py` I attempted to run that, but it threw errors. I'm running this in docker on Windows 11 ### Model Input Dumps _No response_…

phazei updated 2 weeks ago

vllm-project/llm-compressor #936

Does `one_shot` save model twice?

Hi thanks for the lib! When checking https://github.com/vllm-project/llm-compressor/issues/935, it seems that `one_shot` auto saves everything to the output folder. That looks great, but if I understa…

fzyzcjy updated 5 days ago

efeslab/Atom #23

Question about KV Cache quantization

Hi, thanks for your great work! I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kerne…

SherrySwift updated 2 months ago

a-ghorbani/pocketpal-ai #79

[Feat]: quantized KV cache and flash attention

**Description** [Flash attention](https://github.com/ggerganov/llama.cpp/pull/5021) and [quantized kv stores](https://github.com/ggerganov/llama.cpp/discussions/59320) are both supported by llama.cpp…

mseri updated 1 day ago

vllm-project/vllm #10283

[Bug]: LLM initialization time increases significantly with …

### Your current environment vllm 0.5.2 The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to b…

piood updated 2 weeks ago

vllm-project/vllm #10151

[Usage]: Error executing method determine_num_available_bloc…

### Your current environment I want to deply neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 with 8 x NVIDIA L20， use -tensor-parallel-size=8 --enforce-eager --trust-remote-code --quantization=fp8 --kv…

SamuelScc updated 1 week ago

mit-han-lab/smoothquant #69

general question about SmoothQuant kv-cache quantization

1. Is kv-cache actually **not used** in all the LLM-evaluation tasks, since those tasks usually takes **only one-step** attention calculation, not like language generating process which needs a lot of…

brisker updated 1 month ago

vllm-project/vllm #10294

[Feature]: Quark quantization format upstream to VLLM

Quark is a comprehensive cross-platform toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimiz…

kewang-xlnx updated 2 days ago

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization