NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

Attention sink #104

Closed jqueguiner closed 9 months ago

jqueguiner commented 11 months ago

Hi 👋 and thanks for the amazing job can’t wait to see the developments in the next few weeks and months.

any plan to work on attention sink ?

jdemouth-nvidia commented 11 months ago

Hi @jqueguiner ,

Thanks for your support. We are considering adding that feature to TensorRT-LLM but nothing concrete at this point. We are not ready to commit on a date when it'll be added (if ever).

Thanks, Julien

ncomly-nvidia commented 9 months ago

Hi @jqueguiner . StreamingLLM, a technique which takes advantage of Attention Sinks, has been added to the main branch!

Llama example. Take a look & let us know what you think!

zhyncs commented 5 months ago

Hi @jqueguiner . StreamingLLM, a technique which takes advantage of Attention Sinks, has been added to the main branch!

Llama example. Take a look & let us know what you think!

Hi @ncomly-nvidia Has H2O been supported in latest main branch? Thanks.