kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

ggerganov/llama.cpp #10439

Bug: Flash Attention performs worse under ROCM

### What happened? Turning on flash attention degrades the performance when used under ROCM (at least it does with a 7900 xtx). Using batched bench, the degradation is quite minor at a batchsize of 1…

Mushoz updated 4 days ago
27
ggerganov/llama.cpp #6444

Support QuaRot quantization scheme

A new, interesting quantization scheme was published, which not only reduces memory consumption (like current quantization schemes), but als reduces computations. > **[QuaRot: Outlier-Free 4-Bit In…

EwoutH updated 3 days ago
14
vllm-project/vllm #9243

[Bug]: vllm0.6.2 Using FLASHINFER to start VLLM reported an…

Using FLASHINFER to start VLLM reported an error, enabling -- quantification gptq -- kv cache dtype fp8_e5m2 Start command: python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 78…

Rssevenyu updated 4 weeks ago
3
vllm-project/vllm #10537

[Usage]: How to use ROPE scaling for llama3.1 and gemma2?

### Your current environment ```text vllm-0.6.4.post1 ``` ### How would you like to use vllm I am using the latest vllm version, i need to apply rope scaling to llama3.1-8b and gemma2-9b…

hahmad2008 updated 1 day ago
20
vllm-project/vllm #10746

[Bug]: RuntimeError: Error in model execution (GGUF; MOE; Q8…

### Your current environment The output of `python collect_env.py` ```text PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N…

dtischencko updated 3 days ago
3
vllm-project/vllm #7751

[Usage]: How do I configure Phi-3-vision for high throughput…

### How would you like to use vllm I want to run Phi-3-vision with VLLM to support parallel calls with high throughput. In my setup (openai compatible 0.5.4 VLLM server on HuggingFace Inference End…

hommayushi3 updated 6 hours ago
9
vllm-project/vllm #5176

[Usage]: Prefix caching in VLLM

Can anyone help me with these doubts 1)When i launch open ai compatible VLLM server `python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --max-model-len 327…

Abhinay2323 updated 1 week ago
3
ggerganov/llama.cpp #10560

Misc. bug: KV cache loads only into CPU RAM

### Name and Version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capab…

Apanoff updated 3 days ago
1
ggerganov/llama.cpp #10594

Misc. bug: Speculative decoding slower than expected for qua…

### Name and Version ./llama-cli --version version: 4077 (af148c93) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0 ### Operating systems Mac ### W…

baihuajun24 updated 21 hours ago
1
NVIDIA/TensorRT-LLM #2429

trt_build for Llama 3.1 70B fp8 fails with CUDA error

### System Info +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |--------------------…

chrisreese-if updated 2 weeks ago
1

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization