Open mmoskal opened 4 days ago
Hi @mmoskal, thank you for submitting this suggestion. We are in the process of improving the capabilities for generating structured output in TRT-LLM. We shall take your proposal into consideration. I will get back to you once we have a clearer design. CC @juney-nvidia , @tqchen
Thank you @MartinMarciniszyn ! If you are interested in integrating the llguidance library directly into TRT-LLM let me know. The library has a C interface that allows for regex, JSON schema, and Guidance constraints.
I've been working on an OpenAI-compatible REST server, utilizing TensorRT-LLM but not Triton, similar to
openai_server.py
but in Rust and generally production-ready. Its main feature is support for constrained decoding, using Guidance or more specifically the Rust low-level Guidance library. This allows for enforcing JSON schemas on the output (similar to OpenAI's "Structured Output" feature, though much more general and with no initial latency), as well as arbitrary context-free grammars.I've been utilizing the batched logits post processor - I wrote a cuda kernel that applies boolean mask to logits (ie., sets logits to -inf where mask is false). Because I needed to support different temperature per token, the kernel also scales logits. There is an additional hack in there to support temperature==0.0 (greedy sampling). It would be nice it logits post processor allowed for changing sampling parameters.
However, the main feature that is missing and that would help performance is the ability to append more than one token to the sequence in the output phase - fast-forward tokens. This can be thought of an "speculation that is always right".
An example, where fast forward tokens are useful is generating data adhering to a certain JSON schema. The logit processor forces
{"name":"
to be generated, then the model generatesJohn"
, the processor forces,\n"age":
, model generates42
, and so on. Another example is chain-of-thought reasoning, where after the model generated a sentence, the controller forces more instructions for the model, the model generates more text, and so on. If used, these greatly speed up generation process.The sglang implements this though it was first done in Guidance.
One way to allow this functionality is to extend logit post-processor with the following callback that for each sequence is given the sampled token and returns a vector of tokens (typically it would be just the sampled token in the vector but could be something else, not necessarily starting with the sampled token).
The callback would take request ids, sampled tokens - a vector of num_requests of vectors of beam size of sampled tokens, and user req ids. It would return a vector of num_requests, of vectors of beam size of vectors of tokens to append to each beam.
For example:
John
, post-processor returns {John
}"
, post-processor returns {",
,\n
,"
,age
,":
} (assuming such tokens exists - note how"
was replaced by",
)The signature above looks somewhat complicated due to beams and requests, for a single beam, single request it's essentially: