kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

ollama/ollama #5091

KV Cache Quantization

It would be good if the KV Key cache type could be set in Ollama. llama.cpp allows you to set the Key cache type which can improve memory usage as the KV store increases in size, especially when ru…

sammcj updated 2 weeks ago
14
vllm-project/llm-compressor #660

KV Cache Quantization example cause problem

**Describe the bug** Following [readme](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_kv_cache/README.md) here I cannot get a fp8 weight activation and kv cache quant…

weicheng59 updated 1 week ago
3
vllm-project/vllm #10411

[Bug]: KV Cache Quantization with GGUF turns out quite poorl…

### Your current environment The output of `python collect_env.py` I attempted to run that, but it threw errors. I'm running this in docker on Windows 11 ### Model Input Dumps _No response_…

phazei updated 2 weeks ago
3
huggingface/transformers #35041

Enable Quantize KV Cache for Mistral Model

### Feature request Enable Quantize KV Cache for Mistral Model, as described in #30483. ### Motivation KV cache quantization has emerged as a crucial optimization, particularly in high-throughput, …

Bojun-Feng updated 20 hours ago
1
efeslab/Atom #23

Question about KV Cache quantization

Hi, thanks for your great work! I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kerne…

SherrySwift updated 2 months ago
3
pytorch/executorch #7132

compile error"Vulkan backend does not support quantization a…

### 🐛 Describe the bug when I compile the pte file from llama-7b-chat as indicated by "https://pytorch.org/executorch/stable/build-run-vulkan.html", I find that the generated ptr file size is too big…

l2002924700 updated 15 hours ago
1
vllm-project/llm-compressor #936

Does `one_shot` save model twice?

Hi thanks for the lib! When checking https://github.com/vllm-project/llm-compressor/issues/935, it seems that `one_shot` auto saves everything to the output folder. That looks great, but if I understa…

fzyzcjy updated 6 days ago
7
mit-han-lab/smoothquant #69

general question about SmoothQuant kv-cache quantization

1. Is kv-cache actually **not used** in all the LLM-evaluation tasks, since those tasks usually takes **only one-step** attention calculation, not like language generating process which needs a lot of…

brisker updated 1 month ago
1
a-ghorbani/pocketpal-ai #79

[Feat]: quantized KV cache and flash attention

**Description** [Flash attention](https://github.com/ggerganov/llama.cpp/pull/5021) and [quantized kv stores](https://github.com/ggerganov/llama.cpp/discussions/59320) are both supported by llama.cpp…

mseri updated 2 days ago
1
google-ai-edge/ai-edge-torch #369

Quantization of Llama results in TFLite file without prefill…

### Description of the bug: I tried running the example.py script given for quantization example, but for Llama. Wherever the reference to Gemma was made, I made appropriate references to Llama. The…

Arya-Hari updated 4 days ago
6

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization