NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.65k stars 986 forks source link

Support for text embedding models #1213

Open SupreethRao99 opened 8 months ago

SupreethRao99 commented 8 months ago

With the popularity of RAG, it would be great if TensorRT-LLM supported text-embedding and re-ranking models from sentence-transformers.

jasonngap1 commented 7 months ago

+1 to this. Would be great if embedding models can be served on Triton servers.

FernandoDorado commented 1 week ago

Is there any update on this? I'm also interested on this capability

nv-guomingz commented 1 hour ago

cc @ncomly-nvidia @AdamzNV @laikhtewari for vis