How to serve multiple TensorRT-LLM models in the same process / server?

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.11k stars 895 forks source link

How to serve multiple TensorRT-LLM models in the same process / server? #984

Open cody-moveworks opened 7 months ago

cody-moveworks commented 7 months ago

Hi there! I'm trying to serve multiple TensorRT-LLM models and I'm wondering what the recommended approach is. I'm using Python to serve TensorRT-LLM models. I've tried / considered:

GenerationSession: I tried instantiating two GenerationSession objects and running inference against both sessions by sending each session one request at a time (i.e. both sessions are processing only a single request, but the sessions are running concurrently) but I ran into errors. Not sure if this is expected.
GptManager: If I understand correctly, the GptManager runs a generation loop for a single model only, so a single Python process can only support one model.
Triton Inference Server's TensorRT-LLM backend: It looks like the backend only supports serving one model per server as it uses GptManager internally.

Is it possible to serve multiple TensorRT-LLM models in the same process / server? Or do I need to host TensorRT-LLM models on separate processes / servers?

achartier commented 7 months ago

We are working on adding multiple models in the Triton backend using MPI processes.

A similar approach could be used to implement support with a GptManager per process.

achartier commented 4 months ago

Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option.

kalradivyanshu commented 1 month ago

@achartier If I am to understand correctly from:

When using the --multi-model option, the Triton model repository can contain multiple TensorRT-LLM models. When running multiple TensorRT-LLM models, the gpu_device_ids parameter should be specified in the models config.pbtxt configuration files. It is up to you to ensure there is no overlap between allocated GPU IDs.

If I want to deploy 4 different LLM models using triton, I need a server with 4 GPUs? Since there must not be overlap between allocated GPU IDs?

achartier commented 1 month ago

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.

anubhav-agrawal-mu-sigma commented 1 month ago

@achartier do we have an example on how we can server multiple models TRT-LLM models using triton. Like deploying two LLM models.

achartier commented 1 month ago

Yes, see the link to the documentation in my April 16 message.