TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
I'm testing gptq on 2*A30, and find something strange.
When I build model with max_batch_size = 64, it's ok to run with batch = 64, input = 32, output = 96.
But when I build model with max_batch_size = 128, it fails to run batch = 64, input = 32, output = 96. Here is the error message. Why is that?
Here is my build command: