kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

Isotr0py/SakuraLLM-Notebooks #8

在kaggle中使用P100时出现错误

闲来无事想试试P100的推理速度在装载模型的时候出现错误： ``` (…)kura-14b-qwen2beta-v0.9-iq4_xs_ver2.gguf: 100% 7.85G/7.85G [00:39

wsndshx updated 2 months ago
3
vllm-project/vllm #10569

[Bug]: llama-3.2-11B-vision run in vllm==0.6.3 OOM error（L20…

### Your current environment The output of `python collect_env.py` WARNING 11-22 07:19:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make s…

Jamrainbow updated 1 week ago
4
Marker-Inc-Korea/AutoRAG #696

[BUG] ValueError: XFormers does not support attention logits…

**Describe the bug** ValueError: XFormers does not support attention logits soft capping. **Full Error log** { "name": "ValueError", "message": "XFormers does not support attention lo…

daegonYu updated 2 months ago
8
THUDM/CodeGeeX4 #23

vllm加载模型之后没推理，一直满GPU占用，是怎么回事？

代码如下: ``` from transformers import AutoTokenizer from vllm import LLM, SamplingParams max_model_len, tp_size = 131072, 1 model_name = "/models/codegeex4-all-9b" tokenizer = AutoTokenizer.from_pr…

luguoyixiazi updated 2 months ago
5
NVIDIA/TensorRT-LLM #2313

Phi-3-mini-128k error

envirmonent: hardware: rtx4090 Driver Version: 550.107.02 software: cuda release 12.4, V12.4.131 absl-py 2.1.0 accelerate 0.31.0 aenum …

scuizhibin updated 1 month ago
2
vllm-project/vllm #8096

[Bug]: Unable to serve minicpm-v2.6 with GGUF quantization

### Your current environment The output of `vllm 0.5.5 vllm-flash-attn 2.6.1` ```text Your output of `python collect_env.py` here ``` downloa…

Sakura4036 updated 3 months ago
6
ggerganov/llama.cpp #8832

Bug: RPC inference is drastically slower even on localhost

### What happened? I am trying to run inference on RPC example. When running the llama-cli with rpc feature over a single rpc-server on localhost, the inference throughput is only 1.9 tok/sec for lla…

hafezmg48 updated 2 months ago
3
vllm-project/vllm #8270

[Usage]: Distributed inference with edge case: model fits mu…

### Your current environment ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A …

leszekhanusz updated 2 months ago
7
sgl-project/sglang #1533

[Bug] ValueError: The memory capacity is unbalanced

### Checklist - [X] 1. I have searched related issues but cannot get the expected help. - [ ] 2. The bug has not been fixed in the latest version. - [X] 3. Please note that if the bug-related issue y…

chuangzhidan updated 1 month ago
2
ollama/ollama #6199

Ollama crashes with Deepseek-Coder-V2-Lite-Instruct

### What is the issue? The output is cut in the middle of generation. Here's the log: ``` Aug 06 15:10:46 user-desktop systemd[4465]: Started Ollama Service. Aug 06 15:10:46 user-desktop ollama[…

shockme updated 1 month ago
9

上一页 1...94 95 96 97 98 99 100...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization