NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

is it possible to use only attention from tensorrt_llm/layers/attention.py #2477

Closed mdfaheem786 closed 1 day ago

mdfaheem786 commented 1 day ago

hi team,

I need some clarifications. This attention used flash attention(faster implementation), which can be used in Pytorch and TensorRT (TRT). But I wanted to use only attention class in my flow. If so, please guide me through the steps. It's dependent on the network, too (default_net(), and which needs to be set), which I got to know when I was trying to understand the flow.

please correct me fmha is also related to flash-attention.

Thank you.

PerkzZheng commented 1 day ago

you can take a look at the gpt attention plugins about how fmhaRunner is used. see https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp#L1983-L2026, and https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp#L1320-L1364.

mdfaheem786 commented 1 day ago

thank you so much for your fast replay @PerkzZheng

Sorry to ask you more, but can I get an equivalent in Python?

PerkzZheng commented 1 day ago

thank you so much for your fast replay @PerkzZheng

Sorry to ask you more, but can I get an equivalent in Python?

you can refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/tests/attention/test_gpt_attention.py (based on trt plugins), or you may need to wrap everything through pytorch ops or python bindings.

mdfaheem786 commented 1 day ago

thank you for guiding me, @PerkzZheng

PerkzZheng commented 1 day ago

thank you for guiding me, @PerkzZheng

No problem.