Feature: Speculative sampling / Assisted Generation

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.31k stars 931 forks source link

Feature: Speculative sampling / Assisted Generation #169

Open michaelfeil opened 11 months ago

michaelfeil commented 11 months ago

An obvious feature to me, but also not one that is simple to implement - is speculative sampling on the road map?

The idea would be using a second tiny-model combined with e.g. for greedy validation of the main model. For more information: https://huggingface.co/blog/assisted-generation

Example models for Speculative sampling: https://huggingface.co/bigcode/tiny_starcoder_py

anjalibshah commented 11 months ago

Yes, it's on the roadmap

ncomly-nvidia commented 11 months ago

Hi @michaelfeil Speculative Decoding is on our roadmap. Are you looking a draft model or self-speculative decoding? Or other?

Also, we will soon put out a rough set of requests we've received along with a rough roadmap so stay tuned!

ywran commented 11 months ago

will the specculative decoding support python runtime？

jdemouth-nvidia commented 11 months ago

Hi @ywran ,

The support will be added to the C++ runtime first. We are also taking the question of a Python runtime seriously and are evaluating the best approch to offer a Python binding of our C++ runtime. We do not have a concrete timeline for now but we are going to keep everyone updated as we make progress.

Thanks, Julien

michaelfeil commented 10 months ago

Hi @michaelfeil Speculative Decoding is on our roadmap. Are you looking a draft model or self-speculative decoding? Or other?

Also, we will soon put out a rough set of requests we've received along with a rough roadmap so stay tuned!

Primarily looking into speeding up e.g. starcoder(15B) with "old style" assisted generation, e.g. https://huggingface.co/bigcode/tiny_starcoder_py. If you got additional ideas, I am open to discuss.

jFkd1 commented 10 months ago

Looks like vllm is close to having this feature implemented: https://github.com/vllm-project/vllm/pull/1679. Any news for TensorRT?

ncomly-nvidia commented 10 months ago

We are making the required improvements to the MHA kernels right now & are looking at a few different techniques for speculative decoding. Keep an eye out over the next few weeks for more details

Dev-hestabit commented 10 months ago

Hey @ncomly-nvidia can you please tell us when we can be able to test assisted Generation on TensorRT?

ncomly-nvidia commented 10 months ago

Hey @Dev-hestabit. Our goal is to have a functional preview of speculative decoding in the next release (<1 month). We'll be sure to include it in the discussion when it is added to main & release notes once included officially.

shannonphu commented 9 months ago

@ncomly-nvidia What models are planned to have support for speculative decoding?

ncomly-nvidia commented 9 months ago

Hi @shannonphu we are starting with Llama variants. What models are you interested in?

shannonphu commented 9 months ago

@ncomly-nvidia I am interested in encoder-decoder type models like T5/FLAN-T5. I am not sure if its possible to do speculative decoding on enc-dec though :)

MrD005 commented 9 months ago

@ncomly-nvidia i have been going through tensorrt backend commits and 4 days ago there is an update for speculative decoding deployment https://github.com/triton-inference-server/tensorrtllm_backend/commit/130999578896e8cca24aae58fa93fed4c2f22141

can we try speculative decoding with tensorrtllm backend?

is there any document that can help us as there is no update in readme for that repo .

ncomly-nvidia commented 9 months ago

Yep!

We're working on an example w/ docs now - there is an implementation you can reference here

Alireza3242 commented 3 weeks ago

Assisted Generation implemented with transformers:

https://huggingface.co/blog/gemma-july-update#assisted-generation

We need tensorrt add assisted-generation