-
### What behavior of the library made you think about the improvement?
As of now Medusa is generating hallucinations as the speculative multihead is not supporting the outline decoding grammar.
…
-
### 🚀 The feature, motivation and pitch
Speculative decoding can achieve 50%+ latency reduction, but in vLLM it can suffer from the throughput-optimized default scheduling strategy where prefills are…
-
how to use speculative decoding?
is there any document for understanding it better?
added support in recent update for both tensorRT llm and TensorRT llm backend
-
### 🚀 The feature, motivation and pitch
I want to implement tree attention for vllm mentioned in [RoadMap](https://github.com/vllm-project/vllm/issues/3861). But I don’t know whether I should imple…
-
Hello there,
I was wondering if it were possible to have the self-speculative decoding operate using IQ2 as the draft model and FP8 as the core model (as it has been shown that FP8 is very rarely …
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of…
-
- [ ] [self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding](https://github.com/dilab-zju/self-speculative-decoding/blob/main/README.md?plain=1)
# Self-Speculative Decod…
-
https://github.com/dust-tt/llama-ssp
Any plans to implement speculative decoding? Would probably improve latency by at least 2x and seems not too difficult to implement.
-
This could be used for LLMs and hopefully for encoder-decoder models like using the smaller NLLB coupled with the bigger NLLB models
-
"We tested the speculative inference using the first 100 inputs from alpaca test dataset as prompts. When model=gpt2-xl, draft_model=gpt2".
I want to test speedup for my own model and draft_model. …