-
闲来无事想试试P100的推理速度
在装载模型的时候出现错误:
```
(…)kura-14b-qwen2beta-v0.9-iq4_xs_ver2.gguf: 100%
7.85G/7.85G [00:39
-
### Your current environment
The output of `python collect_env.py`
WARNING 11-22 07:19:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make s…
-
**Describe the bug**
ValueError: XFormers does not support attention logits soft capping.
**Full Error log**
{
"name": "ValueError",
"message": "XFormers does not support attention lo…
-
代码如下:
```
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
max_model_len, tp_size = 131072, 1
model_name = "/models/codegeex4-all-9b"
tokenizer = AutoTokenizer.from_pr…
-
envirmonent:
hardware: rtx4090
Driver Version: 550.107.02
software: cuda release 12.4, V12.4.131
absl-py 2.1.0
accelerate 0.31.0
aenum …
-
### Your current environment
The output of `vllm 0.5.5
vllm-flash-attn 2.6.1`
```text
Your output of `python collect_env.py` here
```
downloa…
-
### What happened?
I am trying to run inference on RPC example. When running the llama-cli with rpc feature over a single rpc-server on localhost, the inference throughput is only 1.9 tok/sec for lla…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
### Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue y…
-
### What is the issue?
The output is cut in the middle of generation. Here's the log:
```
Aug 06 15:10:46 user-desktop systemd[4465]: Started Ollama Service.
Aug 06 15:10:46 user-desktop ollama[…