Closed mdfaheem786 closed 1 day ago
you can take a look at the gpt attention plugins about how fmhaRunner
is used.
see https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp#L1983-L2026,
and https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp#L1320-L1364.
thank you so much for your fast replay @PerkzZheng
Sorry to ask you more, but can I get an equivalent in Python?
thank you so much for your fast replay @PerkzZheng
Sorry to ask you more, but can I get an equivalent in Python?
you can refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/tests/attention/test_gpt_attention.py (based on trt plugins), or you may need to wrap everything through pytorch ops or python bindings.
thank you for guiding me, @PerkzZheng
thank you for guiding me, @PerkzZheng
No problem.
hi team,
I need some clarifications. This attention used flash attention(faster implementation), which can be used in Pytorch and TensorRT (TRT). But I wanted to use only attention class in my flow. If so, please guide me through the steps. It's dependent on the network, too (default_net(), and which needs to be set), which I got to know when I was trying to understand the flow.
please correct me fmha is also related to flash-attention.
Thank you.