NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.5k stars 964 forks source link

fast-forward tokens in logits post processor #2365

Open mmoskal opened 4 days ago

mmoskal commented 4 days ago

I've been working on an OpenAI-compatible REST server, utilizing TensorRT-LLM but not Triton, similar to openai_server.py but in Rust and generally production-ready. Its main feature is support for constrained decoding, using Guidance or more specifically the Rust low-level Guidance library. This allows for enforcing JSON schemas on the output (similar to OpenAI's "Structured Output" feature, though much more general and with no initial latency), as well as arbitrary context-free grammars.

I've been utilizing the batched logits post processor - I wrote a cuda kernel that applies boolean mask to logits (ie., sets logits to -inf where mask is false). Because I needed to support different temperature per token, the kernel also scales logits. There is an additional hack in there to support temperature==0.0 (greedy sampling). It would be nice it logits post processor allowed for changing sampling parameters.

However, the main feature that is missing and that would help performance is the ability to append more than one token to the sequence in the output phase - fast-forward tokens. This can be thought of an "speculation that is always right".

An example, where fast forward tokens are useful is generating data adhering to a certain JSON schema. The logit processor forces {"name":" to be generated, then the model generates John", the processor forces ,\n"age":, model generates 42, and so on. Another example is chain-of-thought reasoning, where after the model generated a sentence, the controller forces more instructions for the model, the model generates more text, and so on. If used, these greatly speed up generation process.

The sglang implements this though it was first done in Guidance.

One way to allow this functionality is to extend logit post-processor with the following callback that for each sequence is given the sampled token and returns a vector of tokens (typically it would be just the sampled token in the vector but could be something else, not necessarily starting with the sampled token).

using PostSamplingCallback = std::function<   std::vector<std::vector<std::vector<TokenIdType>>> (std::vector<IdType> const &seq_ids,
    std::vector<std::vector<TokenIdType>> const &sampled_tokens, 
    std::vector<std::optional<IdType>> const &user_seq_ids)>;

The callback would take request ids, sampled tokens - a vector of num_requests of vectors of beam size of sampled tokens, and user req ids. It would return a vector of num_requests, of vectors of beam size of vectors of tokens to append to each beam.

For example:

The signature above looks somewhat complicated due to beams and requests, for a single beam, single request it's essentially:

vector<Token> post_sampling(Token t) { 
  return {t}; // default
}
MartinMarciniszyn commented 4 days ago

Hi @mmoskal, thank you for submitting this suggestion. We are in the process of improving the capabilities for generating structured output in TRT-LLM. We shall take your proposal into consideration. I will get back to you once we have a clearer design. CC @juney-nvidia , @tqchen

mmoskal commented 4 days ago

Thank you @MartinMarciniszyn ! If you are interested in integrating the llguidance library directly into TRT-LLM let me know. The library has a C interface that allows for regex, JSON schema, and Guidance constraints.