Open SupreethRao99 opened 9 months ago
Yes waiting for this as welll
TensorRT-LLM supports logits processors so it should be possible to integrate.
How are you hoping to use it? Are you seeking outlines.models.tensorrt
, a serve
endpoint, or both?
I'm looking at something similar to outlines.models.tensorrt right now as my usecase is mostly offline batched inference. Could you give me a starting point to how I can build this out, I'm eager to contribute and add such a feature.
@SupreethRao99 glad to hear you're interested in contributing.
I think a good starting point is looking into how TensorRT handles performs generation and handles LogitsProcessor
s
Then I'd review how llamacpp
is being implemented, it shares similarities with how TensorRT would work https://github.com/outlines-dev/outlines/pull/556/ Specifically llamacpp.py
https://github.com/dtiarks/outlines/blob/726ec242fb1695c5a67d489689be13ac84ef472c/outlines/models/llamacpp.py
Please let me know if you have any questions!
Thank you for the resources , I'll definitely get back to you with questions after going through these links.
Thanks!
Related to #655
This can likely be implemented with the Executor API: https://github.com/NVIDIA/TensorRT-LLM/blob/31ac30e928a2db795799fdcab6be446bfa3a3998/examples/cpp/executor/README.md#L4
Outlines currently support the vLLM inference engine, it would be great if it could also support the tensorRT-LLM inference engine.