NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

tensorrtllm backend fails when kv cache is disabled #2443

Open ShuaiShao93 opened 1 week ago

ShuaiShao93 commented 1 week ago

System Info

x86_64, L4 GPU, debian 11 OS

Who can help?

No response

Information

Tasks

Reproduction

To Reproduce

  1. build trtllm engine with trtllm-build --kv_cache_type=disabled
  2. load the model in triton with batching_strategy:inflight_fused_batching, and enable verbose logging
  3. run inference with parallel sessions

Expected behavior

triton should work well when kv cache is disabled

actual behavior

There is error

model_instance_state.cc:1117] "Failed updating TRT LLM statistics: Internal - Failed to find Max KV cache blocks in metrics."

and the batch size of trtllm is always 1

additional notes

N/A