TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
The weights compiled by v0.11.0 are run using TritonServer, and the concurrency is lower than that of the v0.10.0 version. The compilation scripts are the same. #2117
The GPU utilization rate after compilation of v0.11.0 is only 52% - 80%, but the utilization rate of v0.10.0 is stable at 99%.
The resulting problem is that the concurrency of v0.11.0 is very low, only 1/2 of that of v0.10.0.
What could this problem be? Has anyone encountered it?
System Info
OS: ubuntu22.04 GPU: A100 driver: 550.90.07
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
the following is v0.11.0 build script:
v0.10.0 is
The GPU utilization rate after compilation of v0.11.0 is only 52% - 80%, but the utilization rate of v0.10.0 is stable at 99%. The resulting problem is that the concurrency of v0.11.0 is very low, only 1/2 of that of v0.10.0. What could this problem be? Has anyone encountered it?
Expected behavior
same with v0.10.0 version
actual behavior
slower than v0.10.0 version
additional notes
No