Support for TensorRT-LLM

SupreethRao99 commented 9 months ago

Outlines currently support the vLLM inference engine, it would be great if it could also support the tensorRT-LLM inference engine.

teis-e commented 9 months ago

Yes waiting for this as welll

lapp0 commented 9 months ago

TensorRT-LLM supports logits processors so it should be possible to integrate.

How are you hoping to use it? Are you seeking outlines.models.tensorrt, a serve endpoint, or both?

SupreethRao99 commented 9 months ago

I'm looking at something similar to outlines.models.tensorrt right now as my usecase is mostly offline batched inference. Could you give me a starting point to how I can build this out, I'm eager to contribute and add such a feature.

lapp0 commented 9 months ago

@SupreethRao99 glad to hear you're interested in contributing.

I think a good starting point is looking into how TensorRT handles performs generation and handles LogitsProcessors

https://github.com/NVIDIA/TensorRT-LLM/blob/0ab9d17a59c284d2de36889832fe9fc7c8697604/tensorrt_llm/runtime/generation.py#L368-L386

https://github.com/NVIDIA/TensorRT-LLM/blob/0ab9d17a59c284d2de36889832fe9fc7c8697604/tensorrt_llm/runtime/model_runner_cpp.py#L266-L267

Then I'd review how llamacpp is being implemented, it shares similarities with how TensorRT would work https://github.com/outlines-dev/outlines/pull/556/ Specifically llamacpp.py https://github.com/dtiarks/outlines/blob/726ec242fb1695c5a67d489689be13ac84ef472c/outlines/models/llamacpp.py

Please let me know if you have any questions!

SupreethRao99 commented 9 months ago

Thank you for the resources , I'll definitely get back to you with questions after going through these links.

Thanks!

rlouf commented 9 months ago

Related to #655

user-0a commented 2 months ago

This can likely be implemented with the Executor API: https://github.com/NVIDIA/TensorRT-LLM/blob/31ac30e928a2db795799fdcab6be446bfa3a3998/examples/cpp/executor/README.md#L4

dottxt-ai / outlines

Support for TensorRT-LLM #632