Open steventu27 opened 4 hours ago
I tried the --fp16 in building tensorrt engine and successfully called kernel _gemm_mha_v2
. Any idea/doc about why int8 is not supported by MHA kernel?
If user want to use mha kernel, need satisfy some other conditions.
SM version must be >= 75. The input types of the two batched matrix multiplications must be FP16, INT8 (refer to the following regarding quantize and dequantize layer placement), or BF16. Head size H must satisfy the constraints 16 <= H <= 256 and H % 8 == 0 for FP16 and BF16. Head size must be 16, 32, or 64, and sequence lengths (S_q, S_kv) must be <= 512 for INT8. INT8 fused MHA will be generated only if quantized and dequantized layers are placed before the first batched matrix multiplication, after softmax, and after the second batched matrix multiplication. TensorRT may decide not to fuse an MHA graph into a single kernel based on performance evaluation or other constraints.
Description
Use exact ONNX file
attention_ln_opset13.onnx
from https://github.com/NVIDIA/TensorRT/issues/3575#issuecomment-1874776406Attention is like
When I use Tensorrt 10.3 with JetPack 6.1, command is
trtexec --onnx=attention_ln_opset13.onnx --int8 --saveEngine=default_int8.engine
, the nsys as follows doesn't use MHA kernel.nsys_engine.zip:
Jetpack info
Package: nvidia-jetpack Source: nvidia-jetpack (6.1) Version: 6.1+b123 Architecture: arm64 Maintainer: NVIDIA Corporation Installed-Size: 194 Depends: nvidia-jetpack-runtime (= 6.1+b123), nvidia-jetpack-dev (= 6.1+b123) Homepage: http://developer.nvidia.com/jetson Priority: standard Section: metapackages