Closed DefTruth closed 8 months ago
solved. I have rewrite the attn to match the pattern below and use TRT 9.2, then, mha_v2 kernel has been used.
[B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] -> MatMul -> [B, N, S, S] -> MatMul -> [B, N, S, h] -Transpose-> [B, S, N, h] -Reshape-> [B, S, H] -LayerNorm->...
[B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, h, S] ---^ ^
[B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] --------------------------------
specific, for Q, K, V is:
Q: [B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] -> MatMul -> [B, N, S, S] -> MatMul -> [B, N, S, h] -Transpose-> [B, S, N, h] -Reshape-> [B, S, H] -LayerNorm->...
k: [B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, h, S] ---^ ^
V: [B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] --------------------------------
Description
I want to figure out if the Attention fused by Myelin is run on MHA kernel, but the nsys results shows that only xmma_gemm kernel apply, so, how can i use MHA/FMHA Kernel in TensorRT manually, any docs can help, many thanks ~
nsys profile results:
ONNX vs Layers after Myelin optimization
Environment
TensorRT Version: 9.2
NVIDIA GPU: A30 / 3080
NVIDIA Driver Version: 525
CUDA Version: 12.2
CUDNN Version: 8.9
Operating System: Linux
Python Version (if applicable): 3.10
Tensorflow Version (if applicable): none
PyTorch Version (if applicable): 2.1.2
Baremetal or Container (if so, version): none
Relevant Files
Model link: none
Steps To Reproduce
to reproduce, please check the blog: