NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 938 forks source link

Increase chunk size while streaming #1623

Open avianion opened 4 months ago

avianion commented 4 months ago

Is it possible to increase the amount of tokens sent per chunk during the streaming process and how to do so?

This could also be with triton-inference-server

byshiue commented 4 months ago

I am a little confused your question. Do you want to get more tokens each time in streaming? (Since you use chunk size, I want to make sure it is not related to the chunked-context feature).

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."