Open ynwang007 opened 2 weeks ago
That's correct, you can find an example using ExternalDraftTokensConfig
in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L628
An example using the C++ executor API will be provided next update.
Hi,
I am interested to use a draft model as speculative decoding, and the only example I found is: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/draft_target_model
We use tensorRT LLM (c++ runtime) with the python executor interface: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp, can anyone provide instructions regarding how to support draft model speculative decoding on top of that?
If I understand it correctly, we have to implement the logic to generate draft tokens for each iteration ourselves and then pass them to the target model executor? Is there a way to have an executor API to do the work for us? Thanks!