TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
System Info
Who can help?
@kaiyux How to use a simple method to prove that enable_kv_cache_reuse is working correctly?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Steps
Send request
Expected behavior
The client accesses the server multiple times, and the subsequent time consumption becomes less and less.
actual behavior
There is almost no change in the time cost
4:client infer cost 10279 9:client infer cost 10318 14:client infer cost 10339 19:client infer cost 10330 24:client infer cost 10329 29:client infer cost 10367 34:client infer cost 10362 39:client infer cost 10343 44:client infer cost 10389 49:client infer cost 10367 54:client infer cost 10367 59:client infer cost 10366
additional notes
no