NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.74k stars 2.13k forks source link

--bf16 doesn't work for convolutional layers #3718

Open youki-sada opened 7 months ago

youki-sada commented 7 months ago

Do you have any plan to fix --bf16 option since it does not affect convolutional layers and those remain tf32? We succeeded bfloat16 quantization by setting precisionConstraints and layerPrecisions with wildcard. However, the performance is not the same as --fp16.

layer precisions from trtexec --onnx=tmp.onnx --bf16 TREx

image

option latency[ms] FPS
none (tf32) 347.6 184.1
--fp16 255.9 250.1
--fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw 151.5 422.3
--bf16 348.2 183.8
--bf16 --precisionConstraints=obey --layerPrecisions=/*:bf16 283.4 225.8
--bf16 --precisionConstraints=obey --layerPrecisions=/*:bf16 --inputIOFormats=bf16:chw --outputIOFormats=bf16:chw 177.8 360.0

Related issue; #3583

Environment

TensorRT Version: TensorRT OSS v9.3.0 NVIDIA GPU: RTX4090 NVIDIA Driver Version: 535.154.05 CUDA Version: 12.2

Operating System: ubuntu22.04 docker

Relevant Files

tmp.zip

zerollzeng commented 7 months ago

@nvpohanh ^ ^

nvpohanh commented 7 months ago

Is there reason why your application could not run in FP16? We would like to understand if there are any ConvNet examples where BF16 must be fused to help us decide the priority of BF16 conv perf optimizations. Thanks!

youki-sada commented 7 months ago

We are working on efficient vision transformer models and those adopts convolutions for former layers and multi-head attention for latter ones. As for multi-head attention, we need to use BF16 or FP32 to maintain accuracy. Thus, in our case, it would be easiest way to quantize whole layers with BF16 including convolutional layers.

nvpohanh commented 7 months ago

I see. so it is a networks with convs + transformers, right? I will bring this feedback internally for discussion.

For MHA (multi-head attention) + FP16, is it because it runs into overflow issue? If so, we can also try FP16 MHA with FP32 accumulation by:

Q -> Cast(toFP32) -> MatMul -> Cast(toFP16) -> Softmax -> Cast(toFP32) -> MatMul -> Cast(toFP16) -> ...
K -> Cast(toFP32) ----^
                                                     V -> Cast(toFP32) ----^
youki-sada commented 7 months ago

Yes, it is consist of both convs and transformers. We found FP16+BF16 mixed precision solved accuracy degradation, but It will be very simple if we can use BF16 for all layers by --bf16.

I will bring this feedback internally for discussion.

I appreciate it. For detail, the accuracy degradation is because of overflow and underflow issue. Some networks (e.g. EfficientViT) adopts linear attention that divides MHA output by its last channel. This division is usually fused into next pointwise layer in TRT and zero division occurs by FP16 underflow.

lix19937 commented 1 week ago

Try to use the latest version of trt.