-
I've tested speculative decoding feature using llama3 models; I convert draft/target model to trt engine, and launch triton server with bls model, but there seems no performance gain.
environment s…
-
### Motivation
Speculative decoding can speed up generation more than 2x. This degree of speedup is an important feature for a production-grade LM deployment library, and it seems the methods are s…
-
### Proposal to improve performance
With the end-to-end correctness tests merged in https://github.com/vllm-project/vllm/pull/3951, now we will optimize the implementation to get ~50% speedup on 70…
-
### System Info
Transformers Version: 4.42.0
Python environment: 3.10.14
### Who can help?
@sanchit-gandhi
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### …
-
### Motivation.
Speculative Decoding is a crucial feature for reducing latency, currently supported by vLLM (credit to @cadedaniel !). However, when deploying Speculative Decoding in real online LL…
-
-
Good morning(or afternoon/ evening)!
There is a methodology called **self speculative decoding** among the techniques to enhance the speed of LLM inference. Would it be possible to implement this …
-
### Your current environment
docker with vllm/vllm-openai:v0.4.3 (latest)
### 🐛 Describe the bug
python3 -m vllm.entrypoints.openai.api_server --model ./Qwen1.5-72B-Chat/ --max-model-len 2400…
-
Hi, I just tried to use this custom all reduce kernel for speculative decoding. I set ENABLE_INTRA_NODE_COMM=1. But I found the code will stuck after several iteration. Is there some bugs of this kern…
-
### Proposal to improve performance
In https://github.com/vllm-project/vllm/pull/3951 we disable bonus tokens (token sampled from verifier model assuming all proposal tokens are accepted) because i…