integrating support for structured decoding library outlines

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.71k stars 996 forks source link

I would like to link some related resources:

First and foremost, there was a really good talk about integrating Outlines into TRTLLM at the LLMs Night two months ago: https://youtu.be/I-a8sTMUq5o?si=J6at-bDJX-knHNra&t=3111 (The other talks were also worth watching)

I will also link to link some of the MIT Licensed code we developed at DeepInfra as a batched logits post-processor, in case you're interested. https://github.com/deepinfra/tensorrtllm_backend/tree/structured-json I did not precisely implement the idea of the talk, since I had the goal to avoid using Outlines because of CPU bottlenecks (I was not satisfied with the existing caching), but the end result is fairly similar.

In all of the approaches, the main goal was to cache mask tensors on the GPU and mask in a CUDA kernel to avoid synchronization and heavy transfer bottlenecks.

There is still future work needed to implement schema validation.

NVIDIA / TensorRT-LLM

integrating support for structured decoding library outlines #2432