-
We're excited to present the features we're currently working on and planning to support in this roadmap document. Your feedback is highly valued, so please don't hesitate to comment or reach out if y…
-
I want to know how we can run the speculative decoding(Assisted Generation) to increase the token/sec for llama2 based model for optimum.neuron to run on inf2. Similar to what transformers have done f…
-
### 🚀 The feature, motivation and pitch
**TLDR:** We would like an option we can enable to continuously stream the `UsageInfo` when using the streaming completions API. This solves a number of "accou…
-
Speculative sampling is a technique used in machine learning and natural language processing. It involves generating multiple possible outputs or continuations of a sequence using a draft model, and t…
-
Hi,
Thanks for this great repo. As the main branch has vllm 0.5.0, i was wondering if speculative decoding will be supported in sglang also. Right now i am getting the following error:
laun…
-
Hi, thanks for your awesome work!
I'm trying to implement https://github.com/SafeAILab/EAGLE with high-performance kernels. I read [this blog](https://flashinfer.ai/2024/02/02/introduce-flashinfer.…
-
Prompt lookup decoding (PLD) is a variant of speculative decoding that replaces the draft model with a prefix lookup in the current sequence, resulting in a 2-4x throughput boost for input-grounded ta…
-
Hello all,
Thanks for your great work here. We are implementing speculative decoding at mistral.rs, and were in the final stages of testing when we discovered some incredibly strange behavior. Spec…
-
### Some context
I am using AMD MI100 GPUs and I can get ~33 tokens/second for Llama 2 70B using
- compile
- tensor parallelism of 8
- int8 quantization
```
time torchrun --standalone --npr…
-
Tree attention mask is already supported in huggingface/transformers: https://github.com/huggingface/transformers/pull/27539
It will be very helpful for the speculative decoding applications. More se…