NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.68k stars 2.12k forks source link

Set layer precision still doesn't take effect in TensorRT 8.6.1. #3224

Open YouSenRong opened 1 year ago

YouSenRong commented 1 year ago

Description

As I had reflected in this Skipping tactic 0x0000000000000000 due to Myelin error" degrade performance.,set layer precision may failed in TensorRT 8.4.3 due to the ConstShuffleFusion.

In these days, I try TensorRT 8.6.1, but it seems that setting layer precision may still fail due to the ConstShuffleFusion. For example, as show in the graph, the Max op take a const input named "phase0_tf/predict_node/y:0", and the value seem to be fp16 subnormal, so I use set_precision api to set the layer ("phase0_tf/predict_node/y:0") to fp32 explicitly.

image

The verbose logs are as follows:

image

When the fp16 subnormal is not set to fp32, the log are as follows, the layer "phase0_tf/predict_node/y:0 + (Unnamed Layer* 522) [Shuffle]" is fp16 precision: image image

However, when the fp16 subnormal is set to fp32, the layer "phase0_tf/predict_node/y:0 + (Unnamed Layer* 522) [Shuffle]" is still fp16. image

By the way, the ConstShuffleFusion produce two kind of layer, such as image image

image image

I am confused the differences. Is that the reason set_precision fails of the layer "phase0_tf/predict_node/y:0"?

Looking forward to your reply. Thanks a lot!

Environment

TensorRT Version: 8.6.1

NVIDIA GPU: T4

NVIDIA Driver Version: 510

CUDA Version:12.0

CUDNN Version:

Operating System: Ubuntu20.04

Python Version (if applicable):

Tensorflow Version (if applicable): 1.4

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

zerollzeng commented 1 year ago

How do you set the layer precision? did you set the layer contrain to obey? See https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#abdc74c40fe7a0c3d05d2caeccfbc29c1

YouSenRong commented 1 year ago

Tanks for you reply! @zerollzeng

How do you set the layer precision?

I set the precision by calling the setPrecision of a layer, as image

did you set the layer contrain to obey?

Yes, I had set the BuilderFlag::kOBEY_PRECISION_CONSTRAINTS, as: image However, it still doesn't work.

For the other layers, the setPrecision work. Only the setPrecision of "phase0_tf/predict_node/y:0" layer doesn't take effect.

zerollzeng commented 1 year ago

Could you please provide a reproduce for us? thanks!

I would prefer an onnx model that can reproduce this error.

YouSenRong commented 1 year ago

Sorry for the late response. @zerollzeng I have split a subgraph of the model subgraph.onnx.zip, But I can‘t reproduce the error on the subgraph. However, I can reproduce the error on the full model. I run the subgraph and the full model with trtexec based on TensorRT 8.6 with command: ./trtexec --onnx=subgraph.onnx --fp16 --verbose --builderOptimizationLevel=3 --layerPrecisions="phase0_tf/predict_node/y:0:fp32,phase0_tf/predict_node:fp32" --layerOutputTypes="phase0_tf/predict_node/y:0:fp32" --precisionConstraints="obey" > subgraph.log 2>&1

./trtexec --onnx=full_model.onnx --fp16 --verbose --builderOptimizationLevel=3 --layerPrecisions="phase0_tf/predict_node/y:0:fp32,phase0_tf/predict_node:fp32" --layerOutputTypes="phase0_tf/predict_node/y:0:fp32" --precisionConstraints="obey" > full_model.log 2>&1

And the log are show as follows: image It seems that the utilized tactics are different between the subgraph and full model.

Besides, I had set the "phase0_tf/predict_node/y:0" and "phase0_tf/predict_node" to fp32, but the warning message still shows that the layer "phase0_tf/predict_node/y:0 + (Unnamed Layer* 522) [Shuffle]" is fp16 subnormal. image image

For the full model, maybe I have to ask for the agreement to share. Or can I send the full model to you privately instead of publicly on github?

zerollzeng commented 1 year ago

@nvpohanh On the right part of the image, it's a myelin subgraph, is it possible that myelin already set the precision to FP32 but just didn't print it in the log? image

YouSenRong commented 1 year ago

is it possible that myelin already set the precision to FP32 but just didn't print it in the log?

It can't account for the large difference between pure FP32 and mixed FP32&FP16.

nvpohanh commented 1 year ago

Several things I would try:

  1. set the precision of the Concat op before the Max op to FP32 and also set the Concat's output dtype (using set_output_type()) to FP32.
  2. If that doesn't work, add a "Cast" op before Max to cast the Concat's output to FP32, before feeding into Max.

On the right part of the image, it's a myelin subgraph, is it possible that myelin already set the precision to FP32 but just didn't print it in the log?

If the ForeignNode optimization is triggered, we do not have information about the detailed dtype info. We will need to use Nsys to look at it (or use --dumpLayerInfo --profilingVerbosity=detailed with latest TRT internal build).

I think the first thing we should do is to repro the accuracy difference between pure-FP32 and FP32+FP16.

YouSenRong commented 1 year ago

@nvpohanh Do you need the full model to reproduce the error?

nvpohanh commented 1 year ago

Probably don't need the full model, but need a way to repro the "large difference between pure FP32 and mixed FP32&FP16" you mentioned.

YouSenRong commented 1 year ago

Based TensorRT 8.6, the diff between shown as follows. absolute difference: min: 9.02219e-10 (0.000139833, 0.000139832), max: 0.00138001 (0.0436334, 0.0450134), mean: 9.89399e-06 relative difference: min: 5.52027e-06 (0.0022354, 0.00223541), max: 0.141119 (8.68643e-05, 9.91225e-05), mean: 0.00445263 The max relative difference is about 0.14. Base TensorRT 8.4.3, the max relative difference between FP32 and mixed FP32+FP16 is only about 0.01.

For the repro, I try to save the input data.

zerollzeng commented 1 year ago

A similar issue: https://github.com/NVIDIA/TensorRT/issues/3257

ttyio commented 1 year ago

@zerollzeng is this dup of #3257 ? thanks

zerollzeng commented 1 year ago

@zerollzeng is this dup of #3257 ? thanks

Maybe not.

absolute difference: min: 9.02219e-10 (0.000139833, 0.000139832), max: 0.00138001 (0.0436334, 0.0450134), mean: 9.89399e-06 relative difference: min: 5.52027e-06 (0.0022354, 0.00223541), max: 0.141119 (8.68643e-05, 9.91225e-05), mean: 0.00445263 The max relative difference is about 0.14. Base TensorRT 8.4.3, the max relative difference between FP32 and mixed FP32+FP16 is only about 0.01.

The diff doesn't look very big in both case. what is the out data range?

YouSenRong commented 1 year ago

what is the out data range?

What does the out data range mean? I had tried both "enqueueV2" and "enqueueV3" api, but all the results have big diffs. I am organizing the details.

zerollzeng commented 1 year ago

output data range. e.g if the range is [-1, 1], then the diff(max 0.001) look good to me.

YouSenRong commented 1 year ago

1: The data range is [0, 1]. But the relative difference is too big, and I thinks it is caused by that the setting layer precision to FP32 doesn't take effect. Besides, not always max 0.001, some time bigger. For some case where setting layer precision can take effect, the diff is small. But in some case where setting layer precision doesn't take effect, the diff is big.

zerollzeng commented 12 months ago

But the relative difference is too big

if trt output has value 0.000001 and the onnx output has value 0.000002. then you will see the a relative difference of 1. Have you try Po-Han's suggestion to set the layer precision?

YouSenRong commented 12 months ago

if trt output has value 0.000001 and the onnx output has value 0.000002. then you will see the a relative difference of 1.

Yes, I understand, but the absolute diff is not so small.

Have you try Po-Han's suggestion to set the layer precision?

Yes, I set the layer precision, but it still doesn't take effect as Po-Han's suggestion.

zerollzeng commented 12 months ago

Okay, I think we need a reproduce to debug this issue further.

YouSenRong commented 11 months ago

Take the result of TensorFlow(int FP32) as criterion, compared to TRT8.4, TRT8.6, TRT9.1 (in FP16), 10 samples are as follows:

TF(FP32) vs TRT8.4(FP16) diff_tf_trt8.4.txt

TF(FP32) vs TRT8.6(FP16) diff_tf_trt8.6.txt

TF(FP32) vs TRT9.1(FP16) diff_tf_trt9.1.txt

From these data, it shows that the diff of trt8.6 and trt9.1 is bigger.

Besides, with set_precision, the diff of TF(fp32) and TRT8.4(FP16) can be reduce: diff_tf_trt8.4-set_precision.txt But the set_precision doesn't make sense in TRT8.6 and TRT9.1.