NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

Device CUDA update, caused model to stop running. #2455

Open chrisreese-if opened 3 days ago

chrisreese-if commented 3 days ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

We built our model using tritonserver container 24.10, the base server's CUDA was at 12.4. We are using a cloud provider for our GPU infra, and they updated their CUDA version to 12.7, and our trt-build model stopped working (the error said CUDA mismatch).

But we are still using tritonserver 24.10, so it shouldn't matter? If we use triton 24.10 in compatibility mode with CUDA 12.4, 12.5 and 12.7, do we need 3 different trt builds? And a fourth for 12.6? Is triton really that sensitive to the CUDA version?

Expected behavior

If triton container version is same, and GPU config is same, it should just work.

actual behavior

lauch_triton_server fails.

additional notes

Triton server container version: 24.10 GPU: H100 SXM 80GB Base server CUDA during built: 12.4 server CUDA after update 12.7