[Myelin] Myelin fused Attn but not run at MHA Kernel

NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

Apache License 2.0

10.68k stars 2.12k forks source link

Description

I want to figure out if the Attention fused by Myelin is run on MHA kernel, but the nsys results shows that only xmma_gemm kernel apply, so, how can i use MHA/FMHA Kernel in TensorRT manually, any docs can help, many thanks ~

nsys profile results:

ONNX vs Layers after Myelin optimization

Environment

TensorRT Version: 9.2

NVIDIA GPU: A30 / 3080

NVIDIA Driver Version: 525

CUDA Version: 12.2

CUDNN Version: 8.9

Operating System: Linux

Python Version (if applicable): 3.10

Tensorflow Version (if applicable): none

PyTorch Version (if applicable): 2.1.2

Baremetal or Container (if so, version): none

Relevant Files

Model link: none

Steps To Reproduce

to reproduce, please check the blog:

[TensorRT 9.2]🔥MHA/FMHA vs FlashAttention profile

[B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] -> MatMul -> [B, N, S, S] -> MatMul -> [B, N, S, h] -Transpose-> [B, S, N, h] -Reshape-> [B, S, H] -LayerNorm->... [B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, h, S] ---^ ^ [B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] --------------------------------

Q: [B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] -> MatMul -> [B, N, S, S] -> MatMul -> [B, N, S, h] -Transpose-> [B, S, N, h] -Reshape-> [B, S, H] -LayerNorm->... k: [B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, h, S] ---^ ^ V: [B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] --------------------------------

NVIDIA / TensorRT