TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Hi, I use the InternVL2-20B model, whose decoder is InternLM2. I have successfully converted the decoder model to a TRT engine. The model runs without any bugs when I use the following command:
system info
issue
Hi, I use the InternVL2-20B model, whose decoder is InternLM2. I have successfully converted the decoder model to a TRT engine. The model runs without any bugs when I use the following command:
However, when I change the
max_batch_size
from 16 to 32, 64, or 128:I use the Triton Inference Server to serve the model:
When the batch size is greater than 16, I encounter the following error: