-
Thanks for the great work team. I wonder if there is any plan to add new improvements to speculative decoding such as [Eagle](https://sites.google.com/view/eagle-llm), [Medusa](https://sites.google.co…
-
## Background
Speculative decoding leverages the ability to cheaply generate proposals, and cheaply verify them to achieve speedup for memory-bound inference. Different methods of speculative decodin…
-
### 🚀 The feature, motivation and pitch
Hi,
Do you guys have any workaround for the `Speculative decoding not yet supported for RayGPU backend.` error or idea when the RayGPU backend will support …
-
Recently there was a project called Medusa which was released. It basically trains more `lm_head`'s that instead of predicting the next token, they predict the token n+2, n+3, and n+4 before generatin…
-
### Your current environment
vllm-0.4.3
### 🐛 Describe the bug
When I use the speculative mode and prompt_length+output_length > 2048, the error occurs
When I use the speculative mode, I use th…
-
Transformers 4.35 only supports speculative decoding for batch size == 1. In order to use speculative decoding for batch size > 1, please make sure to use this branch: https://github.com/huggingface/t…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
Hi, thank you so much for your awesome work!
I notice that when running ``equal.py`` to compare decoding tokens of speculative decoding methods (pld/eagle/hydra) with vanilla decoding tokens, the …
-
### 🚀 The feature, motivation and pitch
[Parallel/Jacobi decoding](https://arxiv.org/abs/2305.10427) improves inference efficiency by breaking the sequential nature of conventional auto-regressive …
-
This paper might be of interest: https://arxiv.org/pdf/2402.12374.pdf