Open kumar-devesh opened 1 week ago
I would like to link some related resources:
First and foremost, there was a really good talk about integrating Outlines into TRTLLM at the LLMs Night two months ago: https://youtu.be/I-a8sTMUq5o?si=J6at-bDJX-knHNra&t=3111 (The other talks were also worth watching)
I will also link to link some of the MIT Licensed code we developed at DeepInfra as a batched logits post-processor, in case you're interested. https://github.com/deepinfra/tensorrtllm_backend/tree/structured-json I did not precisely implement the idea of the talk, since I had the goal to avoid using Outlines because of CPU bottlenecks (I was not satisfied with the existing caching), but the end result is fairly similar.
In all of the approaches, the main goal was to cache mask tensors on the GPU and mask in a CUDA kernel to avoid synchronization and heavy transfer bottlenecks.
There is still future work needed to implement schema validation.
I was exploring structured text decoding libraries for a use case and in latency and throughput benchmarks found outlines to work the best with some recent PRs to fix its throughput issues for larger batch sizes. Would like to integrate outlines as a dependency to use its
LogitsProcessor
wrapping it to work with TensorRT-LLM like in VLLM