NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 895 forks source link

How to serve multiple TensorRT-LLM models in the same process / server? #984

Open cody-moveworks opened 7 months ago

cody-moveworks commented 7 months ago

Hi there! I'm trying to serve multiple TensorRT-LLM models and I'm wondering what the recommended approach is. I'm using Python to serve TensorRT-LLM models. I've tried / considered:

Is it possible to serve multiple TensorRT-LLM models in the same process / server? Or do I need to host TensorRT-LLM models on separate processes / servers?

achartier commented 7 months ago

We are working on adding multiple models in the Triton backend using MPI processes.

A similar approach could be used to implement support with a GptManager per process.

achartier commented 4 months ago

Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option.

kalradivyanshu commented 1 month ago

@achartier If I am to understand correctly from:

When using the --multi-model option, the Triton model repository can contain multiple TensorRT-LLM models. When running multiple TensorRT-LLM models, the gpu_device_ids parameter should be specified in the models config.pbtxt configuration files. It is up to you to ensure there is no overlap between allocated GPU IDs.

If I want to deploy 4 different LLM models using triton, I need a server with 4 GPUs? Since there must not be overlap between allocated GPU IDs?

achartier commented 1 month ago

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.

anubhav-agrawal-mu-sigma commented 1 month ago

@achartier do we have an example on how we can server multiple models TRT-LLM models using triton. Like deploying two LLM models.

achartier commented 1 month ago

Yes, see the link to the documentation in my April 16 message.