TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Thank you for creating openai-server.py. It has been very helpful in avoiding the need to use vLLM or other OpenAI-like proxies.
I need to deploy several LLMs and embedding models. After reviewing the code for openai-server.py, I noticed that it currently handles only a single model. I am planning to use Triton for inference and manage GPU utilization through Triton.
How would you recommend managing GPU resources through TensorRT-LLM? Is there a way to use multiple LLMs with openai-server.py? Additionally, is there a planned implementation for an API to support Triton-hosted models?
Hi, thanks for using and helping to improve our library.
Can you please explain how you are going to "manage GPU utilization through Triton"?
TRT-LLM has a runtime argument to manage how much free memory should be used for KV Cache, I am not quite sure how you are going to manage GPU utilization, but this argument could be one option. Currently, we do not include it as an openai server argument, but it will not take much effort to add.
For serving multiple models with openai server, I am afraid that we do not have an elegant way to do this at this point. But you can still try to use multiple processes to run multiple server instances (and control their memory fraction carefully), then direct requests to their model.
"an API to support Triton-hosted models", for this question, do you mean implementing openai api in triton?
Hello,
Thank you for creating openai-server.py. It has been very helpful in avoiding the need to use vLLM or other OpenAI-like proxies.
I need to deploy several LLMs and embedding models. After reviewing the code for openai-server.py, I noticed that it currently handles only a single model. I am planning to use Triton for inference and manage GPU utilization through Triton.
How would you recommend managing GPU resources through TensorRT-LLM? Is there a way to use multiple LLMs with openai-server.py? Additionally, is there a planned implementation for an API to support Triton-hosted models?
Thanks in advance for your help!