NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

integrating support for structured decoding library outlines #2432

Open kumar-devesh opened 1 week ago

kumar-devesh commented 1 week ago

I was exploring structured text decoding libraries for a use case and in latency and throughput benchmarks found outlines to work the best with some recent PRs to fix its throughput issues for larger batch sizes. Would like to integrate outlines as a dependency to use its LogitsProcessor wrapping it to work with TensorRT-LLM like in VLLM

pathorn commented 1 week ago

I would like to link some related resources:

First and foremost, there was a really good talk about integrating Outlines into TRTLLM at the LLMs Night two months ago: https://youtu.be/I-a8sTMUq5o?si=J6at-bDJX-knHNra&t=3111 (The other talks were also worth watching)

I will also link to link some of the MIT Licensed code we developed at DeepInfra as a batched logits post-processor, in case you're interested. https://github.com/deepinfra/tensorrtllm_backend/tree/structured-json I did not precisely implement the idea of the talk, since I had the goal to avoid using Outlines because of CPU bottlenecks (I was not satisfied with the existing caching), but the end result is fairly similar.

In all of the approaches, the main goal was to cache mask tensors on the GPU and mask in a CUDA kernel to avoid synchronization and heavy transfer bottlenecks.

There is still future work needed to implement schema validation.