[question] How to achieve maximum GPU utilization with TensoRT-LLM lib using ```openai-server.py```

thehumit commented 1 month ago

Hello,

Thank you for creating openai-server.py. It has been very helpful in avoiding the need to use vLLM or other OpenAI-like proxies.

I need to deploy several LLMs and embedding models. After reviewing the code for openai-server.py, I noticed that it currently handles only a single model. I am planning to use Triton for inference and manage GPU utilization through Triton.

How would you recommend managing GPU resources through TensorRT-LLM? Is there a way to use multiple LLMs with openai-server.py? Additionally, is there a planned implementation for an API to support Triton-hosted models?

Thanks in advance for your help!

LinPoly commented 1 month ago

Hi, thanks for using and helping to improve our library.

Can you please explain how you are going to "manage GPU utilization through Triton"?
TRT-LLM has a runtime argument to manage how much free memory should be used for KV Cache, I am not quite sure how you are going to manage GPU utilization, but this argument could be one option. Currently, we do not include it as an openai server argument, but it will not take much effort to add.
For serving multiple models with openai server, I am afraid that we do not have an elegant way to do this at this point. But you can still try to use multiple processes to run multiple server instances (and control their memory fraction carefully), then direct requests to their model.
"an API to support Triton-hosted models", for this question, do you mean implementing openai api in triton?

github-actions[bot] commented 1 day ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

NVIDIA / TensorRT-LLM

[question] How to achieve maximum GPU utilization with TensoRT-LLM lib using ```openai-server.py``` #2307