NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.84k stars 2.14k forks source link

mAP drops a lot when Infer a INT8 quantized ONNX model. #2237

Closed yunyaoXYY closed 1 year ago

yunyaoXYY commented 2 years ago

Description

Hi, I have a quantized Yolov5s ONNX model; When I use ONNX runtime to infer this model, I got the mAP of 36.8; But when I use C++ TRT backend, enable with INT8 inference, the mAP drops 10.9, I'm not sure what the problem is, could you please give some advice and check the model (attachment) ? Thanks!

Environment

TensorRT Version: 8.4.1.5 NVIDIA GPU: Tesla P40
NVIDIA Driver Version: 510.47.03 CUDA Version: 11.2 CUDNN Version: 8.1.1 Operating System: Linux Python Version (if applicable): 3.7 Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

Relevant Files

model link : https://bj.bcebos.com/v1/paddle-slim-models/act/yolov5s_quant.onnx file : yolov5s_quant.onnx.zip

Steps To Reproduce

csrcs/fastdeploy/backends/tensorrt/trt_backend.cc(91)::CheckDynamicShapeConfig The loaded model's input tensor:x2paddle_images has shape [1, 3, 640, 640]. [08/10/2022-09:56:39] [I] [TRT] [MemUsageChange] Init CUDA: CPU +196, GPU +0, now: CPU 239, GPU 1274 (MiB) [08/10/2022-09:56:40] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +6, GPU +2, now: CPU 262, GPU 1276 (MiB) [08/10/2022-09:56:40] [W] [TRT] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [INFO] csrcs/fastdeploy/backends/tensorrt/trt_backend.cc(430)::CreateTrtEngine Start to building TensorRT Engine... [08/10/2022-09:56:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +184, GPU +76, now: CPU 506, GPU 1352 (MiB) [08/10/2022-09:56:55] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +128, GPU +80, now: CPU 634, GPU 1432 (MiB) [08/10/2022-09:56:55] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.1.1 [08/10/2022-09:56:55] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [08/10/2022-09:58:20] [I] [TRT] Detected 1 inputs and 4 output network tensors. [08/10/2022-09:58:20] [I] [TRT] Total Host Persistent Memory: 145120 [08/10/2022-09:58:20] [I] [TRT] Total Device Persistent Memory: 2082816 [08/10/2022-09:58:20] [I] [TRT] Total Scratch Memory: 0 [08/10/2022-09:58:20] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 34 MiB, GPU 213 MiB [08/10/2022-09:58:20] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 87.029ms to assign 11 blocks to 181 nodes requiring 24294912 bytes. [08/10/2022-09:58:20] [I] [TRT] Total Activation Memory: 24294912 [08/10/2022-09:58:20] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +7, GPU +9, now: CPU 7, GPU 9 (MiB) [08/10/2022-09:58:20] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1. [08/10/2022-09:58:21] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1. [08/10/2022-09:58:21] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 616, GPU 1414 (MiB) [08/10/2022-09:58:21] [I] [TRT] Loaded engine size: 8 MiB [08/10/2022-09:58:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +9, now: CPU 0, GPU 9 (MiB) [INFO] csrcs/fastdeploy/backends/tensorrt/trt_backend.cc(496)::CreateTrtEngine TensorRT Engine is built succussfully. [08/10/2022-09:58:21] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +25, now: CPU 0, GPU 34 (MiB) loading annotations into memory... Done (t=0.62s) creating index... index created! 2022-08-10 09:58:21 [INFO] Starting to read file list from dataset... 2022-08-10 09:58:22 [INFO] ...

Then it is the log to record the mAP. mAP is very low compared to the result from ONNX backend.

zerollzeng commented 2 years ago

Just did a quick test with polygraphy:

[I] onnxrt-runner-N0-08/11/22-13:12:28
    ---- Inference Input(s) ----
    {x2paddle_images [dtype=float32, shape=(1, 3, 640, 640)]}
[I] onnxrt-runner-N0-08/11/22-13:12:28
    ---- Inference Output(s) ----
    {save_infer_model/scale_0.tmp_0 [dtype=float32, shape=(1, 25200, 85)]}
[I] onnxrt-runner-N0-08/11/22-13:12:28  | Completed 1 iteration(s) in 127.3 ms | Average inference time: 127.3 ms.
[I] Accuracy Comparison | trt-runner-N0-08/11/22-13:12:28 vs. onnxrt-runner-N0-08/11/22-13:12:28
[I]     Comparing Output: 'save_infer_model/scale_0.tmp_0' (dtype=float32, shape=(1, 25200, 85)) with 'save_infer_model/scale_0.tmp_0' (dtype=float32, shape=(1, 25200, 85))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-08/11/22-13:12:28: save_infer_model/scale_0.tmp_0 | Stats: mean=8.6759, std-dev=56.93, var=3241, median=0.00322, min=1.3113e-06 at (0, 972, 4), max=638.06 at (0, 2239, 0), avg-magnitude=8.6759
[I]             ---- Histogram ----
                Bin Range        |  Num Elems | Visualization
                (1.16e-06, 65.5) |    2088552 | ########################################
                (65.5    , 131 ) |      11194 |
                (131     , 196 ) |       6518 |
                (196     , 262 ) |       5630 |
                (262     , 327 ) |       5201 |
                (327     , 393 ) |       5169 |
                (393     , 458 ) |       5191 |
                (458     , 524 ) |       5386 |
                (524     , 589 ) |       5341 |
                (589     , 655 ) |       3818 |
[I]         onnxrt-runner-N0-08/11/22-13:12:28: save_infer_model/scale_0.tmp_0 | Stats: mean=8.6698, std-dev=56.936, var=3241.7, median=0.0031733, min=1.1623e-06 at (0, 972, 4), max=654.85 at (0, 24869, 2), avg-magnitude=8.6698
[I]             ---- Histogram ----
                Bin Range        |  Num Elems | Visualization
                (1.16e-06, 65.5) |    2088626 | ########################################
                (65.5    , 131 ) |      11133 |
                (131     , 196 ) |       6502 |
                (196     , 262 ) |       5614 |
                (262     , 327 ) |       5207 |
                (327     , 393 ) |       5178 |
                (393     , 458 ) |       5190 |
                (458     , 524 ) |       5388 |
                (524     , 589 ) |       5342 |
                (589     , 655 ) |       3820 |
[I]         Error Metrics: save_infer_model/scale_0.tmp_0
[I]             Minimum Required Tolerance: elemwise error | [abs=108.3] OR [rel=2.1755] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.061848, std-dev=0.67083, var=0.45002, median=0.00027022, min=0 at (0, 222, 4), max=108.3 at (0, 24771, 2), avg-magnitude=0.061848
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (0   , 10.8) |    2140748 | ########################################
                    (10.8, 21.7) |        905 |
                    (21.7, 32.5) |        227 |
                    (32.5, 43.3) |         59 |
                    (43.3, 54.2) |         39 |
                    (54.2, 65  ) |         15 |
                    (65  , 75.8) |          3 |
                    (75.8, 86.6) |          2 |
                    (86.6, 97.5) |          1 |
                    (97.5, 108 ) |          1 |
[I]             Relative Difference | Stats: mean=0.11164, std-dev=0.10477, var=0.010977, median=0.084045, min=0 at (0, 222, 4), max=2.1755 at (0, 21445, 23), avg-magnitude=0.11164
[I]                 ---- Histogram ----
                    Bin Range      |  Num Elems | Visualization
                    (0    , 0.218) |    1858320 | ########################################
                    (0.218, 0.435) |     252097 | #####
                    (0.435, 0.653) |      26393 |
                    (0.653, 0.87 ) |       4058 |
                    (0.87 , 1.09 ) |        834 |
                    (1.09 , 1.31 ) |        208 |
                    (1.31 , 1.52 ) |         64 |
                    (1.52 , 1.74 ) |         16 |
                    (1.74 , 1.96 ) |          4 |
                    (1.96 , 2.18 ) |          6 |
[E]         FAILED | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[E]     FAILED | Mismatched outputs: ['save_infer_model/scale_0.tmp_0']
[!] FAILED | Command: /usr/local/bin/polygraphy run yolov5s_quant.onnx --trt --int8 --onnxrt

@pranavm-nvidia @ttyio Do you have any suggestions here?

ttyio commented 1 year ago

@yunyaoXYY , have you tried different calibration algorithm, tuning the QAT param? Also there is a sample sensitivity analysis code worth try in https://github.com/NVIDIA/NeMo/blob/main/examples/asr/quantization/speech_to_text_quant_infer.py#L71

Thanks!

yunyaoXYY commented 1 year ago

@yunyaoXYY , have you tried different calibration algorithm, tuning the QAT param? Also there is a sample sensitivity analysis code worth try in https://github.com/NVIDIA/NeMo/blob/main/examples/asr/quantization/speech_to_text_quant_infer.py#L71

Thanks!

Hi, This problem has been solved a few months ago, thanks .