TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
We built our model using tritonserver container 24.10, the base server's CUDA was at 12.4. We are using a cloud provider for our GPU infra, and they updated their CUDA version to 12.7, and our trt-build model stopped working (the error said CUDA mismatch).
But we are still using tritonserver 24.10, so it shouldn't matter? If we use triton 24.10 in compatibility mode with CUDA 12.4, 12.5 and 12.7, do we need 3 different trt builds? And a fourth for 12.6? Is triton really that sensitive to the CUDA version?
Expected behavior
If triton container version is same, and GPU config is same, it should just work.
actual behavior
lauch_triton_server fails.
additional notes
Triton server container version: 24.10
GPU: H100 SXM 80GB
Base server CUDA during built: 12.4
server CUDA after update 12.7
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
We built our model using tritonserver container 24.10, the base server's CUDA was at 12.4. We are using a cloud provider for our GPU infra, and they updated their CUDA version to 12.7, and our trt-build model stopped working (the error said CUDA mismatch).
But we are still using tritonserver 24.10, so it shouldn't matter? If we use triton 24.10 in compatibility mode with CUDA 12.4, 12.5 and 12.7, do we need 3 different trt builds? And a fourth for 12.6? Is triton really that sensitive to the CUDA version?
Expected behavior
If triton container version is same, and GPU config is same, it should just work.
actual behavior
lauch_triton_server fails.
additional notes
Triton server container version: 24.10 GPU: H100 SXM 80GB Base server CUDA during built: 12.4 server CUDA after update 12.7