FP16 Accuracy failure of TensorRT 8.6.3 when running trtexec built engine on GPU RTX4090

bernardrb commented 3 months ago

Description

We are trying to recreate the results from: https://arxiv.org/abs/2402.05008.

Using the same .onnx file to compile engines provided by the authors, we find that using fp16 model has 0% accuracy while the fp32 model has accuracy as expected. This was previously possible for us, but now compiling the model in fp16 is not working.

The measurements are mIoU scores for different sized objects.

L2 - FP16 {"all": 0.0, "large": 0.0, "medium": 0.0, "small": 0.0}

L2 - FP32 {"all": 79.12385607181146, "large": 83.05853600575689, "medium": 81.50597370444349, "small": 74.8830670481846}

Environment


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
| 40%   32C    P8              6W /  450W |      11MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Baremetal or Container (if so, version): tensort-24.02-py

Relevant Files

engine inspection: l2__fp16_inspect.txt l2_fp32_inspect.txt

build log: l2_fp32.log l2_fp16.log

onnx link: https://drive.google.com/drive/folders/1Yt8xDfdkmL6W-IO-KhUhR_J-_2ion-v5?usp=sharing

brb-nv commented 3 months ago

Hi, thank you for the detailed information.

This was previously possible for us, but now compiling the model in fp16 is not working.

Can you please let me know what you mean by this? Was there an older version of TRT with which you'd get reasonable FP16 results? But, not with the current version you're using?

Fwiw, I've tried running your model using polygraphy using onnxrt and trt as backends and the results had some discrepancy. polygraphy run random/github_issues/3893/l2_encoder.onnx --onnxrt --trt --verbose

Will need to investigate further.

bernardrb commented 3 months ago

Thanks for the quick response.

Was there an older version of TRT with which you'd get reasonable FP16 results? But, not with the current version you're using?

To clarify we are using the same version of TRT that is provided in the nvcr.io/nvidia/tensorrt:24.02-py3 container before and now. Perhaps this information is more confusing than enlightening. However, it could be better to focus on the reproducible error that we have know: the discrepancy between FP32, and FP16. How do we go about isolating this potential problem further?

This is how we built the engine, adding --fp16 for the fp16 model.

   trtexec --onnx=l2_encoder.onnx \
        --minShapes=input_image:1x3x512x512 \
        --optShapes=input_image:1x3x512x512 \
        --maxShapes=input_image:4x3x512x512 \
        --saveEngine=l2_encoder.engine

lix19937 commented 3 months ago

@bernardrb

 polygraphy run l2_encoder.onnx --trt --onnxrt  --fp16 \
     --trt-outputs mark all \
     --onnx-outputs mark all

to see which layer is diff first, may be fp16 overflow.

bernardrb commented 3 months ago

Here is the polygraphy log. There were other processes running on the GPU in parallel, which causes the long inference latencies.

polygraphy.log

Snippet that includes pass rate, and comparison of the network output.

[E] Accuracy Summary | trt-runner-N0-05/25/24-11:45:48 vs. onnxrt-runner-N0-05/25/24-11:45:48 | Passed: 0/1 iterations | Pass 
Rate: 0.0%

Comparing Output: 'image_embeddings' (dtype=float32, shape=(1, 256, 64, 64)) with 'image_embeddings' (dtype=float32, shape=(1, 256, 64, 64))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-05/25/24-11:45:48: image_embeddings | Stats: mean=-0.0042355, std-dev=0.092963, var=0.0086421, median=-0.0030975, min=-0.46509 at (0, 104, 60, 56), max=0.55225 at (0, 73, 63, 0), avg-magnitude=0.064991
[I]             ---- Histogram ----
                Bin Range          |  Num Elems | Visualization
                (-0.517 , -0.402 ) |        105 | 
                (-0.402 , -0.288 ) |      10035 | 
                (-0.288 , -0.173 ) |      35547 | ##
                (-0.173 , -0.0587) |     175281 | ###########
                (-0.0587, 0.0558 ) |     613651 | ########################################
                (0.0558 , 0.17   ) |     178516 | ###########
                (0.17   , 0.285  ) |      26618 | #
                (0.285  , 0.399  ) |       8333 | 
                (0.399  , 0.514  ) |        487 | 
                (0.514  , 0.628  ) |          3 | 
[I]         onnxrt-runner-N0-05/25/24-11:45:48: image_embeddings | Stats: mean=-0.0047792, std-dev=0.093688, var=0.0087775, median=-0.0037169, min=-0.51662 at (0, 184, 5, 38), max=0.62824 at (0, 73, 0, 0), avg-magnitude=0.065327
[I]             ---- Histogram ----
                Bin Range          |  Num Elems | Visualization
                (-0.517 , -0.402 ) |       1321 | 
                (-0.402 , -0.288 ) |       7299 | 
                (-0.288 , -0.173 ) |      38876 | ##
                (-0.173 , -0.0587) |     173984 | ###########
                (-0.0587, 0.0558 ) |     618909 | ########################################
                (0.0558 , 0.17   ) |     175937 | ###########
                (0.17   , 0.285  ) |      22389 | #
                (0.285  , 0.399  ) |       9523 | 
                (0.399  , 0.514  ) |        330 | 
                (0.514  , 0.628  ) |          8 | 
[I]         Error Metrics: image_embeddings
[I]             Minimum Required Tolerance: elemwise error | [abs=0.15575] OR [rel=7.8643e+05] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.015861, std-dev=0.015567, var=0.00024232, median=0.011677, min=0 at (0, 160, 33, 55), max=0.15575 at (0, 73, 28, 0), avg-magnitude=0.015861
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (0     , 0.0156) |     637054 | ########################################
                    (0.0156, 0.0311) |     265806 | ################
                    (0.0311, 0.0467) |      92225 | #####
                    (0.0467, 0.0623) |      33242 | ##
                    (0.0623, 0.0779) |      14037 | 
                    (0.0779, 0.0934) |       4728 | 
                    (0.0934, 0.109 ) |       1167 | 
                    (0.109 , 0.125 ) |        257 | 
                    (0.125 , 0.14  ) |         51 | 
                    (0.14  , 0.156 ) |          9 | 
[I]             Relative Difference | Stats: mean=2.6304, std-dev=780.83, var=6.097e+05, median=0.23168, min=0 at (0, 160, 33, 55), max=7.8643e+05 at (0, 74, 12, 18), avg-magnitude=2.6304
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 7.86e+04) |    1048574 | ########################################
                    (7.86e+04, 1.57e+05) |          1 | 
                    (1.57e+05, 2.36e+05) |          0 | 
                    (2.36e+05, 3.15e+05) |          0 | 
                    (3.15e+05, 3.93e+05) |          0 | 
                    (3.93e+05, 4.72e+05) |          0 | 
                    (4.72e+05, 5.51e+05) |          0 | 
                    (5.51e+05, 6.29e+05) |          0 | 
                    (6.29e+05, 7.08e+05) |          0 | 
                    (7.08e+05, 7.86e+05) |          1 | 
[E]         FAILED | Output: 'image_embeddings' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)

Regarding tolerance threshold, is this too strict? All layers fail.

What conclusion can we draw from this?

brb-nv commented 3 months ago

I think Polygraphy is mainly meant for accuracy debugging and not latency debugging. trtexec is a better tool for figuring latency issues.

Here are some insights from a colleague who specializes in quantization:

Generally, for FP16 we expect 1e-3 abs OR relative difference. However, in the log snippet above I'm not seeing --fp16 or --stronglyTyped which means this model runs fp32 only.

Polygraphy has some options for debugging TensorRT accuracy. Check out these useful options:

Auto-reduce model Iteratively build reduced models and test, until we reach a minimal model that is still failing. Everything after "--check" is the command that determines if the current reduced model is "failing" or "passing" - it should specify requested precision and the desired accuracy. https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy//examples/cli/debug/02_reducing_failing_onnx_models

Manually isolating subgraphs. In case auto-reduce isn't coming up with anything, you can manually reduce it. Polygraphy has a CLI for onnx-graphsurgeon that makes it a little easier to extract subgraphs https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/surgeon/01_isolating_subgraphs

Debug precision Iteratively mark layers precision as FP32 and find minimal subset of layers that should run in FP32 in order for the test to pass. Similar API to debug reduce, but not really documented. Use this example: polygraphy debug precision --fp16 model.onnx --precision float32 --check polygraphy run polygraphy_debug.engine --fp16 --trt --onnxrt

bernardrb commented 3 months ago

However, in the log snippet above I'm not seeing --fp16 or --stronglyTyped which means this model runs fp32 only.

What log file are you referring to? If it is polygraphy.log in 3893#issuecomment-2131240255, then the --fp16 flag is set (1st row)

[I] RUNNING | Command: /usr/local/bin/polygraphy run assets/onnx/l2_encoder.onnx --trt --onnxrt --fp16 --trt-outputs mark all --onnx-outputs mark all

I think Polygraphy is mainly meant for accuracy debugging and not latency debugging. trtexec is a better tool for figuring latency issues.

I agree, sorry for the confusing statement.

Here are the results for Auto-reduce model, and Debug precision:

I think I found the smallest subgraph as seen in the text. Debug precision failed. Final reduced onnx: https://drive.google.com/drive/folders/1Yt8xDfdkmL6W-IO-KhUhR_J-_2ion-v5?usp=sharing

Having analyzed the network graph, what is next? Is the solution likely to set precision constraints on the failing nodes? Are there other solutions, such that one can still run the model in FP16?

Auto-reduce model

1.


$ polygraphy surgeon sanitize l2_encoder.onnx -o folded.onnx --fold-constants \
    --override-input-shapes x0:[1,3,224,224] x1:[1,3,224,224]


#### 2.
polygraphy run folded.onnx --onnxrt \
    --save-inputs inputs.json \
    --onnx-outputs mark all --save-outputs layerwise_golden.json

polygraphy data to-input inputs.json layerwise_golden.json -o layerwise_inputs.json

3.

$ polygraphy debug reduce folded.onnx -o initial_reduced.onnx --mode=bisect --load-inputs layerwise_inputs.json \
    --check polygraphy run polygraphy_debug.onnx --trt --rtol 1e-3 --atol 1e-3 \
            --load-inputs layerwise_inputs.json --load-outputs layerwise_golden.json

[I] Finished 10 iteration(s) | Passed: 3/10 | Pass Rate: 30.0%
[I] Finished reducing model outputs
[I] Marking model outputs: [Variable (/image_encoder/backbone/stages.0/op_list.1/main/conv1/conv/Conv_output_0): (shape=[1, 32, 256, 256], dtype=float32)]
[I] Reducing model inputs
[I]     RUNNING | Iteration 1 | Approximately 4 iteration(s) remaining
[I]         Marking model inputs: [Variable (/image_encoder/backbone/stages.0/op_list.0/act/Add_output_0): (shape=[1, 32, 256, 256], dtype=float32)]
[I]         Running inference with ONNX-Runtime to determine metadata for intermediate tensors.
            This will cause intermediate models to have static shapes.
[I]             Running fallback shape inference using input metadata:
                {input_image [dtype=float32, shape=(1, 3, 512, 512)]}
[I]         Freezing tensor: Variable (input_image): (shape=[1, 3, 512, 512], dtype=float32) to eliminate branches.
[W] It looks like this model contains foldable nodes that produce large outputs.
In order to avoid bloating the model, you may want to set a constant-folding size threshold.
Note: Large tensors and their corresponding sizes were: {'/image_encoder/backbone/stages.0/op_list.0/conv/Conv_output_0': '8 MiB'}
[I]         Saving ONNX model to: polygraphy_debug.onnx[I] RUNNING | Command: /usr/local/bin/polygraphy run polygraphy_debug.onnx --trt --fp16 --rtol 1e-3 --atol 1e-3 --load-inputs layerwise_inputs.json --load-outputs layerwise_golden.json

[I] Finished 3 iteration(s) | Passed: 1/3 | Pass Rate: 33.333333333333336%
[I] Finished reducing model inputs
[I] Marking model inputs: [Variable (/image_encoder/backbone/stages.0/op_list.0/act/Tanh_output_0): (shape=[1, 32, 256, 256], dtype=float32)]
[I] Freezing tensor: Variable (input_image): (shape=[1, 3, 512, 512], dtype=float32) to eliminate branches.
[I] Minimum Bad Model:
    Name: torch_jit | ONNX Opset: 17

    ---- 1 Graph Input(s) ----
    {/image_encoder/backbone/stages.0/op_list.0/act/Tanh_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    ---- 1 Graph Output(s) ----
    {/image_encoder/backbone/stages.0/op_list.1/main/conv1/conv/Conv_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    ---- 5 Initializer(s) ----

    ---- 4 Node(s) ----

[I] Saving ONNX model to: initial_reduced.onnx
[I] PASSED | Runtime: 1438.387s | Command: /usr/local/bin/polygraphy debug reduce folded.onnx -o initial_reduced.onnx --mode=bisect --load-inputs layerwise_inputs.json --check polygraphy run polygraphy_debug.onnx --trt --fp16 --rtol 1e-3 --atol 1e-3 --load-inputs layerwise_inputs.json --load-outputs layerwise_golden.json

4.


$ polygraphy inspect model initial_reduced.onnx --show layers
[I] Loading model: /app/assets/onnx/initial_reduced.onnx
[I] ==== ONNX Model ====
    Name: torch_jit | ONNX Opset: 17

    ---- 1 Graph Input(s) ----
    {/image_encoder/backbone/stages.0/op_list.0/act/Tanh_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    ---- 1 Graph Output(s) ----
    {/image_encoder/backbone/stages.0/op_list.1/main/conv1/conv/Conv_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    ---- 5 Initializer(s) ----
    {/image_encoder/backbone/stages.0/op_list.0/act/Constant_2_output_0 [dtype=float32, shape=()],
     /image_encoder/backbone/stages.0/op_list.0/conv/Conv_output_0 [dtype=float32, shape=(1, 32, 256, 256)],
     /image_encoder/backbone/stages.0/op_list.0/act/Constant_3_output_0 [dtype=float32, shape=()],
     onnx::Conv_2378 [dtype=float32, shape=(32, 32, 3, 3)],
     onnx::Conv_2379 [dtype=float32, shape=(32,)]}

    ---- 4 Node(s) ----
    Node 0    | /image_encoder/backbone/stages.0/op_list.0/act/Add_1 [Op: Add]
        {Initializer | /image_encoder/backbone/stages.0/op_list.0/act/Constant_2_output_0 [dtype=float32, shape=()],
         /image_encoder/backbone/stages.0/op_list.0/act/Tanh_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}
         -> {/image_encoder/backbone/stages.0/op_list.0/act/Add_1_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    Node 1    | /image_encoder/backbone/stages.0/op_list.0/act/Mul_4 [Op: Mul]
        {Initializer | /image_encoder/backbone/stages.0/op_list.0/conv/Conv_output_0 [dtype=float32, shape=(1, 32, 256, 256)],
         /image_encoder/backbone/stages.0/op_list.0/act/Add_1_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}
         -> {/image_encoder/backbone/stages.0/op_list.0/act/Mul_4_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    Node 2    | /image_encoder/backbone/stages.0/op_list.0/act/Mul_5 [Op: Mul]
        {Initializer | /image_encoder/backbone/stages.0/op_list.0/act/Constant_3_output_0 [dtype=float32, shape=()],
         /image_encoder/backbone/stages.0/op_list.0/act/Mul_4_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}
         -> {/image_encoder/backbone/stages.0/op_list.0/act/Mul_5_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    Node 3    | /image_encoder/backbone/stages.0/op_list.1/main/conv1/conv/Conv [Op: Conv]
        {/image_encoder/backbone/stages.0/op_list.0/act/Mul_5_output_0 [dtype=float32, shape=(1, 32, 256, 256)],
         Initializer | onnx::Conv_2378 [dtype=float32, shape=(32, 32, 3, 3)],
         Initializer | onnx::Conv_2379 [dtype=float32, shape=(32,)]}
         -> {/image_encoder/backbone/stages.0/op_list.1/main/conv1/conv/Conv_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

4.


$ polygraphy debug reduce initial_reduced.onnx -o final_reduced.onnx --mode=linear --load-inputs layerwise_inputs.json     --check polygraphy run polygraphy_debug.onnx --trt --fp16 --rtol 1e-3 --atol 1e-3             --load-inputs layerwise_inputs.json --load-outputs layerwise_golden.json

5.


polygraphy inspect model final_reduced.onnx --show layers
[I] Loading model: /app/assets/onnx/final_reduced.onnx
[I] ==== ONNX Model ====
    Name: torch_jit | ONNX Opset: 17

    ---- 1 Graph Input(s) ----
    {/image_encoder/backbone/stages.0/op_list.0/act/Mul_4_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    ---- 1 Graph Output(s) ----
    {/image_encoder/backbone/stages.0/op_list.1/main/conv1/conv/Conv_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    ---- 3 Initializer(s) ----
    {/image_encoder/backbone/stages.0/op_list.0/act/Constant_3_output_0 [dtype=float32, shape=()],
     onnx::Conv_2378 [dtype=float32, shape=(32, 32, 3, 3)],
     onnx::Conv_2379 [dtype=float32, shape=(32,)]}

    ---- 2 Node(s) ----
    Node 0    | /image_encoder/backbone/stages.0/op_list.0/act/Mul_5 [Op: Mul]
        {Initializer | /image_encoder/backbone/stages.0/op_list.0/act/Constant_3_output_0 [dtype=float32, shape=()],
         /image_encoder/backbone/stages.0/op_list.0/act/Mul_4_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}
         -> {/image_encoder/backbone/stages.0/op_list.0/act/Mul_5_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

    Node 1    | /image_encoder/backbone/stages.0/op_list.1/main/conv1/conv/Conv [Op: Conv]
        {/image_encoder/backbone/stages.0/op_list.0/act/Mul_5_output_0 [dtype=float32, shape=(1, 32, 256, 256)],
         Initializer | onnx::Conv_2378 [dtype=float32, shape=(32, 32, 3, 3)],
         Initializer | onnx::Conv_2379 [dtype=float32, shape=(32,)]}
         -> {/image_encoder/backbone/stages.0/op_list.1/main/conv1/conv/Conv_output_0 [dtype=float32, shape=(1, 32, 256, 256)]}

Debug precision

polygraphy debug precision --fp16 l2_encoder.onnx --precision float32 --check polygraphy run polygraphy_debug.engine --fp16 --trt --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy debug precision --fp16 l2_encoder.onnx --precision float32 --check polygraphy run polygraphy_debug.engine --fp16 --trt --onnxrt
[W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[I] Using DataType.FLOAT as higher precision
[I]     RUNNING | Iteration 1 | Approximately 12 iteration(s) remaining
[I]     Selecting first 2576 layer(s) to run in higher precision
[W]     Input tensor: input_image (dtype=DataType.FLOAT, shape=(-1, 3, 512, 512)) | No shapes provided; Will use shape: [1, 3, 512, 512] for min/opt/max in profile.
[W]     This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[I]     Configuring with profiles:[
            Profile 0:
                {input_image [min=[1, 3, 512, 512], opt=[1, 3, 512, 512], max=[1, 3, 512, 512]]}
        ]
[I]     Building engine with configuration:
        Flags                  | [FP16, OBEY_PRECISION_CONSTRAINTS]
        Engine Capability      | EngineCapability.DEFAULT
        Memory Pools           | [WORKSPACE: 24259.69 MiB, TACTIC_DRAM: 24259.69 MiB]
        Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
        Profiling Verbosity    | ProfilingVerbosity.DETAILED
        Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[W]     Detected layernorm nodes in FP16: , , , , , , ,
[W]     Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[W]     TensorRT encountered issues when converting weights between types and that could affect accuracy.
[W]     If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[W]     Check verbose logs for the list of affected weights.
[W]     - 3 weights are affected by this issue: Detected subnormal FP16 values.
[I]     Finished engine building in 107.220 seconds
[I]     Running check command: polygraphy run polygraphy_debug.engine --fp16 --trt --onnxrt
[I]     ========== CAPTURED STDOUT ==========
        [I] RUNNING | Command: /usr/local/bin/polygraphy run polygraphy_debug.engine --fp16 --trt --onnxrt
        [E] FAILED | Runtime: 0.022s | Command: /usr/local/bin/polygraphy run polygraphy_debug.engine --fp16 --trt --onnxrt
[E]     ========== CAPTURED STDERR ==========
        [!] Model type: engine could not be converted to an ONNX model.
[I]     Saving debug replay to polygraphy_debug_replay.json
[E]     FAILED | Iteration 1 | Duration 108.50705647468567s
[E]     Could not find a configuration that satisfied accuracy requirements.
[I] Finished 1 iteration(s) | Passed: 0/1 | Pass Rate: 0.0%
[I] PASSED | Runtime: 114.194s | Command: /usr/local/bin/polygraphy debug precision --fp16 l2_encoder.onnx --precision float32 --check polygraphy run polygraphy_debug.engine --fp16 --trt --onnxrt

NVIDIA / TensorRT