NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.72k stars 996 forks source link

[Feature request] Add LogitsProcessor class support in C++ Executor API #1680

Open chiendb97 opened 6 months ago

chiendb97 commented 6 months ago

Hi team, I would like to use the LogitsPostProcessor in the C++ Executor API to control the generation of language models. However, unlike frameworks like Hugging Face, vLLM, or the implementation in Model Runner, which support class-based approaches, this feature currently only supports functions. This limitation makes implementation challenging. Could the TensorRT-LLM team consider adding support for this feature in TensorRT-LLM? Thank you.

AdamzNV commented 5 months ago

@chiendb97 Could you please provide more details about why you need the LogitsProcessor and how you plan to use it in your application? We'll determine its priority based on your application. Many thanks.

DreamGenX commented 5 months ago

@AdamzNV The C++ executor is very cumbersome in certain cases:

When you want to have a parametrized processor:

Simplest example would be e.g. something like "for this request, set value of token X to -inf", or "for this request, increase the value of token X by Y". Right now, the logit processor would need to somehow maintain and access request state through the request id it is provided.

EDIT: It looks like this is actually not possible at all -- at the time we create the request, we don't know its id, so there does not seem to be a way for us to maintain the state for the processor.

When you want to dynamically combine multiple processors:

This is a special case of the above, but it's common that you may want to apply processor X and Y for one request, and processor X an Z for another.

Possible solution:

A nicer API would be able to pass an std::function (or object of some kind of LogitProcessor class). The function can be created on the fly for each request and can encapsulate the necessary state. A single function is in theory enough, as it can internally dispatch to multiple processors as needed.

Let me know if I am missing something obvious.

chiendb97 commented 5 months ago

@chiendb97 Could you please provide more details about why you need the LogitsProcessor and how you plan to use it in your application? We'll determine its priority based on your application. Many thanks.

@AdamzNV I use LogitsProcessor to control the generation of language models, such as generating JSON formatted output or regex formatted output. Additionally, we have a poetry generation application that requires the model's output to adhere to specific poetic rules.

This feature would enable the output sequence to be constrained by a Finite State Machine (FSM) or Context-Free Grammar (CFG), as mentioned in #1111. With function support, it seems possible to use static variables or lambda functions. However, I believe that using classes would make the implementation easier and clearer.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

nv-guomingz commented 1 week ago

Hi @chiendb97 do u still have further issue or question now? If not, we'll close it soon.