kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

ggerganov/llama.cpp #10252

Bug: CANN: Inference result garbled

### What happened? llama.cpp使用QWen2.5-7b-f16.gg在310P3乱码 ### Name and Version ./build/bin/llama-cli -m Qwen2.5-7b-f16.gguf -p "who are you" -ngl 32 -fa ### What operating system are you seeing the …

feichenchina updated 3 days ago
9
oobabooga/text-generation-webui #6126

Use HuggingFace's Quanto library KV Cache Quantization for a…

**Description** HuggingFace's Quanto has implemented 4 bit & 2 bit KV cache quantization compatible with Transformers. See: https://huggingface.co/blog/kv-cache-quantization I may PR when I've t…

Interpause updated 5 months ago
1
NVIDIA/TensorRT-Model-Optimizer #108

[RFC] TensorRT Model Optimizer - Product Roadmap

# TensorRT Model Optimizer - Product Roadmap [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt)’s north star is to be the best-in-class model optimization toolki…

hchings updated 6 days ago
5
huggingface/text-generation-inference #1819

Planned/Potential of significant work

- [ ] Fp8 kv-cache - [ ] Kv-cache prefix reuse - [ ] Grammar constrained speedup - [ ] `torch.compile` like speedups - [ ] Simple one-liner `pip install` - [ ] Multi lora support (lorax kind of) …

Narsil updated 3 months ago
4
ggerganov/llama.cpp #10420

Bug: Vulkan vk::DeviceLostError with multithreaded environme…

### What happened? I would like to begin by expressing my sincere gratitude to the authors for their dedication and effort in developing this work. To provide context for the issue I am encounter…

ddwkim updated 1 week ago
2
NVIDIA/TensorRT-LLM #1885

kv_cache_reuse breaking on awq quantized model

### System Info - X86_64 - RAM: 30 GB - GPU: A10G, VRAM: 23GB - Lib: Tensorrt-LLM v0.9.0 - Container Used: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 - Model used: Mistral 7B ### …

Bhuvanesh09 updated 2 weeks ago
3
lamikr/rocm_sdk_builder #180

gfx906 (AMD MI60) is failing on run_and_save_benchmarks.sh a…

Hi @lamikr, I built rocm_sdk_builder on a freshly installed Ubuntu 24.04.1. It took 5 hours, 120GB of storage and many hours of fixing small issues during building the repo (reference: https://gith…

Said-Akbar updated 3 days ago
28
vllm-project/llm-compressor #30

Q3 ROADMAP

SUMMARY: - [x] Avoid full pass through the model for quantization modifier - [x] Data free `oneshot` - [x] Runtime of GPTQ with large models – how to do a 70B model? - [x] Runtime of GPTQ with act…

robertgshaw2-neuralmagic updated 1 month ago
4
NVIDIA/TensorRT-LLM #2309

kv cache quant lead to model accuracy loss serious?

hi, I found trt-llm kv cache quant lead to model accuracy loss serious, but vllm and lmdeploy only less loss. - model: qwen1.5-7b - evalset: cmmlu ![Image](https://github.com/user-attachments/asset…

liguodongiot updated 2 weeks ago
2
SciSharp/LLamaSharp #980

[BUG]:RUN LLama.Examples =>KernelMemory.cs System.AccessViol…

### Description model https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/qwen2.5-coder-7b-instruct-q5_k_m.gguf Generate using GPU source code Run the source code example LLama Ker…

freefer updated 17 hours ago
3

上一页 1...3 4 5 6 7 8 9...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization