NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

https://developer.nvidia.com/tensorrt

Apache License 2.0

10.6k stars 2.11k forks source link

No MHA (muti head attention) kernal is called in Tensorrt 10.3 in Orin with Jetpack 6.1 #4167

Open steventu27 opened 4 hours ago

steventu27 commented 4 hours ago

Description

Use exact ONNX file attention_ln_opset13.onnx from https://github.com/NVIDIA/TensorRT/issues/3575#issuecomment-1874776406

Attention is like

When I use Tensorrt 10.3 with JetPack 6.1, command is trtexec --onnx=attention_ln_opset13.onnx --int8 --saveEngine=default_int8.engine, the nsys as follows doesn't use MHA kernel.

nsys_engine.zip:

default_int8.engine
default_int8.nsys-rep

Jetpack info

Package: nvidia-jetpack Source: nvidia-jetpack (6.1) Version: 6.1+b123 Architecture: arm64 Maintainer: NVIDIA Corporation Installed-Size: 194 Depends: nvidia-jetpack-runtime (= 6.1+b123), nvidia-jetpack-dev (= 6.1+b123) Homepage: http://developer.nvidia.com/jetson Priority: standard Section: metapackages

steventu27 commented 3 hours ago

I tried the --fp16 in building tensorrt engine and successfully called kernel _gemm_mha_v2. Any idea/doc about why int8 is not supported by MHA kernel?

lix19937 commented 1 hour ago

If user want to use mha kernel, need satisfy some other conditions.

SM version must be >= 75.
The input types of the two batched matrix multiplications must be FP16, INT8 (refer to the following regarding quantize and dequantize layer placement), or BF16.
Head size H must satisfy the constraints 16 <= H <= 256 and H % 8 == 0 for FP16 and BF16.
Head size must be 16, 32, or 64, and sequence lengths (S_q, S_kv) must be <= 512 for INT8.
INT8 fused MHA will be generated only if quantized and dequantized layers are placed before the first batched matrix multiplication, after softmax, and after the second batched matrix multiplication.
TensorRT may decide not to fuse an MHA graph into a single kernel based on performance evaluation or other constraints.