-
### Your current environment
```
(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# pip3 list | grep aphrodite
aphrodite-engine 0.5.3 /workspace/home/lich/aphrodite-eng…
-
### Your current environment
The output of `python collect_env.py`
```text
PyTorch version: 2.4.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N…
-
### Proposal to improve performance
Test new feature medusa speculative sampling with [vllm v0.5.2](vllm-openai:v0.5.2).
After using Medusa speculative sampling, the performance dropped significantl…
-
### Your current environment
```text
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC …
-
### Error
```
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[6], line 1
----…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch…
-
I would like to use techniques such as Multi-instance Support supported by the tensorrt-llm backend. In the documentation, I can see that multiple models are served using modes like Leader mode and …
-
### OS
Linux
### GPU Library
CUDA 12.x
### Python version
3.11
### Describe the bug
When running exllamav2's inference_speculative.py example with llama 3.1 8B 2.25bpw as draft and 70B 4.5bpw a…
-
### System Info
Hardware: L20
Version: 0.11.0.dev20240625
Model: Bloom7b1
### Who can help?
@ncomly-nvidia @byshiue
I have obtained the Medusa head for Bloom according to the official M…
-
# 🐛 Bug
from vllm import LLM, SamplingParams
llm = LLM(model=model_dir,enforce_eager=True)
then
```
File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.…