NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

[Question] Document/examples to enable draft model speculative decoding using c++ executor API #2424

Open ynwang007 opened 2 weeks ago

ynwang007 commented 2 weeks ago

Hi,

I am interested to use a draft model as speculative decoding, and the only example I found is: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/draft_target_model

We use tensorRT LLM (c++ runtime) with the python executor interface: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp, can anyone provide instructions regarding how to support draft model speculative decoding on top of that?

If I understand it correctly, we have to implement the logic to generate draft tokens for each iteration ourselves and then pass them to the target model executor? Is there a way to have an executor API to do the work for us? Thanks!

achartier commented 2 weeks ago

That's correct, you can find an example using ExternalDraftTokensConfig in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L628

An example using the C++ executor API will be provided next update.

achartier commented 1 week ago

See https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/cpp/executor/executorExampleFastLogits.cpp