NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

Do the tensorrt_llm support for the parameter like past_key_values? #1867

Open GooVincent opened 4 days ago

GooVincent commented 4 days ago

Please if any similar parameter like huggingface transformer past_key_values is supported in tensorrt_llm? So that It will be possible to calculate the kv cache in advance, then pass it to ModelRunner.generate() or ModelRunnerCpp.generate(), It will speed up the decode.