NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.51k stars 964 forks source link

[Feature Request] Support for Constrained Decoding (such as generating Json formatted output) #1111

Open silverriver opened 8 months ago

silverriver commented 8 months ago

Summary I would like to propose the addition of constrained decoding support. This feature would allow the output sequence to be constrained by a Finite State Machine (FSM) or Context-Free Grammar (CFG), providing more control over the generated sequences for various applications.

The most simple one is like the json mode provided by Openai API.

This feature is implemented in other repos, like https://github.com/outlines-dev/outlines https://github.com/guidance-ai/guidance?tab=readme-ov-file#constrained-generation

I am wondering if this feature is in the road map of trt-llm?

### Tasks
nivibilla commented 8 months ago

Faster version implemented in sglang https://lmsys.org/blog/2024-02-05-compressed-fsm/

silverriver commented 8 months ago

Faster version implemented in sglang https://lmsys.org/blog/2024-02-05-compressed-fsm/

Yeap, the RadixAttention attention proposed in this paper is also a nice feature to have if we want to constrain the decoded sequence to a given json chema.

fedem96 commented 5 months ago

Adding constrained decoding to this library (like SGLang JSON decoding) would be great, as it would allow a more reliable, faster generation. Is there any news about which release might include it?

dhruvmullick commented 4 months ago

Certainly need this functionality. With vLLM supporting constrained decoding, this could be a dealbreaker for some for TRT-LLM. Is this on the roadmap by any chance (pinging @ncomly-nvidia in case you know?)

mayani-nv commented 4 months ago

would this sample help?

avianion commented 4 months ago

would this sample help?

helpful, but as previously mentioned, tensorrtllm inference is done in c++, whereas that library is in python.

since the inflight batcher used by the triton inference server uses the c++ implementation of trt-llm, that example cannot be used as smoothly without using the pure python inference backend.