TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
System Info
x86_64, L4 GPU, debian 11 OS
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
To Reproduce
trtllm-build --kv_cache_type=disabled
batching_strategy:inflight_fused_batching
, and enable verbose loggingExpected behavior
triton should work well when kv cache is disabled
actual behavior
There is error
and the batch size of trtllm is always 1
additional notes
N/A