[Feature] Speculative Decoding

josephrocca commented 3 weeks ago

Motivation

Speculative decoding can speed up generation more than 2x. This degree of speedup is an important feature for a production-grade LM deployment library, and it seems the methods are starting to mature enough to make their way into frameworks like TGI and vLLM, so might be a good time for LMDeploy to consider adding support for a popular/established speculative decoding method.

Related resources

TGI (supports Medusa and MLPSpeculator as of writing):
- https://huggingface.co/docs/text-generation-inference/basic_tutorials/train_medusa
- https://github.com/huggingface/text-generation-inference/pull/1865
vLLM (groundwork for several speculation methods in progress as of writing):

Below is a copy-paste from a neat project called Spec-Bench. The ranking when running 33B models is similar. Please see the linked repo for latest data.

Device: a single NVIDIA GeForce RTX 3090 GPU (24GB) with 12 CPU cores
Testing environment: Pytorch 2.0.1, under CUDA 11.8
Experimental Settings: Vicuna-7B-v1.3, greedy decoding, FP16 precision, batch size = 1

Models	Multi-turn Conversation	Translation	Summa-rization	Question Answering	Mathematical Reasoning	Retrieval-aug. Generation	#Mean Accepted Tokens	Overall
EAGLE🏅	2.44x	1.81x	2.13x	2.11x	2.54x	1.82x	3.57	2.16x
SpS🥈	1.98x	1.37x	2.00x	1.95x	1.89x	1.76x	2.29	1.83x
Hydra🥉	2.04x	1.67x	1.56x	1.81x	2.16x	1.48x	3.26	1.80x
PLD	1.57x	1.07x	2.31x	1.25x	1.62x	1.56x	1.74	1.55x
Medusa	1.60x	1.38x	1.28x	1.46x	1.64x	1.22x	2.32	1.44x
REST	1.49x	1.18x	1.21x	1.46x	1.35x	1.27x	1.63	1.32x
Lookahead	1.13x	0.97x	1.05x	1.07x	1.29x	0.98x	1.65	1.08x

Note that MLPSpeculator is not included in the benchmark since it is newer. Another new method that isn't included in Spec-Bench as of writing:

https://github.com/apple/ml-recurrent-drafter

zhyncs commented 3 weeks ago

In fact, we have already implemented the Medusa TreeMask version in LMDeploy. When batch=1, the acceleration ratio and RPS improvement relative to the main branch are consistent with those in the blog.

And when the batch size increases, the overhead of Medusa prefill is greater than the benefit of generating multiple tokens at each iteration. We are currently working on solving this problem. Please stay tuned.

zhyncs commented 3 weeks ago

EAGLE also has plans to support open source in the future.

Dbxwz commented 2 weeks ago

@zhyncs I implemented EAGLE in vllm and met the same probelm when the batch size increases. Here is a simple analysis (bs is batch size, k is proposal length, the batch size bottleneck of target model is 3): spec_decode Because the calculation of rejected tokens wastes GPU resources, so skipping speculative decoding is the best choice sometimes.

Meituan's solution introduce a novel sampling mechanism that leverages Thompson Sampling to regulate the generation processes. And someone use a trained control module (I forgot the source).

Or, similar to VLLM's current approach, we can simply skip speculative decoding when the batch size exceeds a certain threshold. It's simple and effective, additional judgment conditions is useful for future enhancements.

zhyncs commented 2 weeks ago

we can simply skip speculative decoding when the batch size exceeds a certain threshold

Thank you for sharing. In fact, this is currently how we do it internally as well, but this approach is still a bit rough. If we want speculative decoding to take effect by default without burdening the user's mind when they are not using it, we also need to dynamically adjust the threshold based on actual workloads, which introduces a certain level of complexity.

In actual usage, the reception rate of Eagle is slightly higher than that of Medusa.

Thompson Sampling Control Mechanism Currently not implemented in actual production environment.

coolhok commented 2 weeks ago

EAGLE also has plans to support open source in the future.

Can you reveal the schedule。Or share the development of the branch together，thanks!!

InternLM / lmdeploy

[Feature] Speculative Decoding #1738

Motivation

Related resources