-
### What happened?
Turning on flash attention degrades the performance when used under ROCM (at least it does with a 7900 xtx). Using batched bench, the degradation is quite minor at a batchsize of 1…
-
A new, interesting quantization scheme was published, which not only reduces memory consumption (like current quantization schemes), but als reduces computations.
> **[QuaRot: Outlier-Free 4-Bit In…
-
Using FLASHINFER to start VLLM reported an error, enabling -- quantification gptq -- kv cache dtype fp8_e5m2
Start command:
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 78…
-
### Your current environment
```text
vllm-0.6.4.post1
```
### How would you like to use vllm
I am using the latest vllm version, i need to apply rope scaling to llama3.1-8b and gemma2-9b…
-
### Your current environment
The output of `python collect_env.py`
```text
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N…
-
### How would you like to use vllm
I want to run Phi-3-vision with VLLM to support parallel calls with high throughput. In my setup (openai compatible 0.5.4 VLLM server on HuggingFace Inference End…
-
Can anyone help me with these doubts
1)When i launch open ai compatible VLLM server `python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --max-model-len 327…
-
### Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capab…
-
### Name and Version
./llama-cli --version
version: 4077 (af148c93)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
### Operating systems
Mac
### W…
-
### System Info
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|--------------------…