[Question] Document/examples to enable draft model speculative decoding using c++ executor API

ynwang007 commented 2 weeks ago

Hi,

I am interested to use a draft model as speculative decoding, and the only example I found is: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/draft_target_model

We use tensorRT LLM (c++ runtime) with the python executor interface: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp, can anyone provide instructions regarding how to support draft model speculative decoding on top of that?

If I understand it correctly, we have to implement the logic to generate draft tokens for each iteration ourselves and then pass them to the target model executor? Is there a way to have an executor API to do the work for us? Thanks!

achartier commented 2 weeks ago

That's correct, you can find an example using ExternalDraftTokensConfig in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L628

An example using the C++ executor API will be provided next update.

achartier commented 1 week ago

See https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/cpp/executor/executorExampleFastLogits.cpp

NVIDIA / TensorRT-LLM

[Question] Document/examples to enable draft model speculative decoding using c++ executor API #2424