Open silverriver opened 9 months ago
Faster version implemented in sglang https://lmsys.org/blog/2024-02-05-compressed-fsm/
Faster version implemented in sglang https://lmsys.org/blog/2024-02-05-compressed-fsm/
Yeap, the RadixAttention attention proposed in this paper is also a nice feature to have if we want to constrain the decoded sequence to a given json chema.
Adding constrained decoding to this library (like SGLang JSON decoding) would be great, as it would allow a more reliable, faster generation. Is there any news about which release might include it?
Certainly need this functionality. With vLLM supporting constrained decoding, this could be a dealbreaker for some for TRT-LLM. Is this on the roadmap by any chance (pinging @ncomly-nvidia in case you know?)
would this sample help?
helpful, but as previously mentioned, tensorrtllm inference is done in c++, whereas that library is in python.
since the inflight batcher used by the triton inference server uses the c++ implementation of trt-llm, that example cannot be used as smoothly without using the pure python inference backend.
cc @AdamzNV @ncomly-nvidia @laikhtewari for vis.
https://github.com/guidance-ai/llgtrt might be of interest. It is native (Rust though) OpenAI compatibile rest server incorporating llguidance Rust library for constrained decoding
Summary I would like to propose the addition of constrained decoding support. This feature would allow the output sequence to be constrained by a Finite State Machine (FSM) or Context-Free Grammar (CFG), providing more control over the generated sequences for various applications.
The most simple one is like the json mode provided by Openai API.
This feature is implemented in other repos, like https://github.com/outlines-dev/outlines https://github.com/guidance-ai/guidance?tab=readme-ov-file#constrained-generation
I am wondering if this feature is in the road map of trt-llm?