-
### Your current environment
**But this is my host env, the Engine is running on the official latest docker image.**
```text
Collecting environment information...
PyTorch version: N/A
Is debug …
-
# Progress
- [x] Implement TPU executor that works on a single TPU chip (without tensor parallelism) #5292
- [ ] Support tensor parallelism for multiple chips in the same host #5871
- [ ] Suppo…
-
### 🚀 The feature, motivation and pitch
Please consider adding support for GPTQ and AWQ quantized Mixtral models.
I guess that after #4012 it's technically possible.
### Alternatives
_No r…
-
Are there plans or a way to support left padding kv attention mask? I believe right padding can be supported with the mha_fwd_kvcache api with the seqlens_k_ pointer, but will there be a similar optio…
-
Hi!
I don't quite understand how this project works, I guess my main question is : `what is a draft model ? `
For example, I would like to speed-up the inference of OwlVit (https://huggingface.…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
### 🚀 The feature, motivation and pitch
`MLPSpeculator`-based speculative decoding was recently added in https://github.com/vllm-project/vllm/pull/4947, but the initial integration only covers sing…
-
### Your current environment
```text
Collecting environment information...
/home/daniel/.pyenv/versions/vllm/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRAN…
-
using nightly wheels. i can serve just fine with --speculative-mode disable, but all the other options give me this:
```
Exception in thread Thread-11 (_background_loop):
Traceback (most recent …
-
### Your current environment
Using latest available docker image: vllm/vllm-openai:v0.5.0.post1
### 🐛 Describe the bug
I am getting as response "Internal Server Error" when calling the /v1/embedd…