TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
System Info
NVIDIA L20 CUDA 12.3 TensorRT-LLM 0.9.0.dev2024032600
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
按照max_batch_size=4编译engine后,执行命令:
LLaVA 13B batch_size=2推理报错
Expected behavior
no bug
actual behavior
get error
additional notes
null