NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.55k stars 2.1k forks source link

onnx model convert trt.int8 failure:fallback fp32 #3754

Open kakascode opened 5 months ago

kakascode commented 5 months ago

Description

When I use TensorRT for int8 quantization, I always encounter the accuracy fallback to fp32. The trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS parameter does not solve the issue. What should I do?"

Environment

TensorRT Version:8.6.16

NVIDIA GPU: A100

CUDA Version:11.4

Operating System:

Python Version (if applicable):3.7

PyTorch Version (if applicable):1.12.1

bernardrb commented 5 months ago

Can you provide what trt logged during the build, and possible the build script?

kakascode commented 5 months ago

Can you provide what trt logged during the build, and possible the build script?

thanks,bro. I truncated the last part because it was too long.

[03/29/2024-17:37:32] [TRT] [V] Setting a default quantization params because quantization data is missing for {ForeignNode[onnx::Gather_401...(Unnamed Layer 3201) [ElementWise]]} [03/29/2024-17:37:32] [TRT] [V] Tactic: 0x0000000000000000 Time: 59.9303 [03/29/2024-17:37:32] [TRT] [V] {ForeignNode[onnx::Gather_401...(Unnamed Layer 3201) [ElementWise]]} (Myelin[0x80000023]) profiling completed in 165.507 seconds. Fastest Tactic: 0x0000000000000000 Time: 59.9303 [03/29/2024-17:37:32] [TRT] [V] >>>>>>>>>>>>>>> Chose Runner Type: Myelin Tactic: 0x0000000000000000 [03/29/2024-17:37:32] [TRT] [V] =============== Computing reformatting costs [03/29/2024-17:37:32] [TRT] [V] =============== Computing reformatting costs [03/29/2024-17:37:32] [TRT] [V] =============== Computing reformatting costs [03/29/2024-17:37:32] [TRT] [V] =============== Computing reformatting costs [03/29/2024-17:37:32] [TRT] [V] =============== Computing reformatting costs [03/29/2024-17:37:32] [TRT] [V] =============== Computing reformatting costs: [03/29/2024-17:37:32] [TRT] [V] Autotuning Reformat: Half(49,1) -> Float(49,1) [03/29/2024-17:37:32] [TRT] [V] --------------- Timing Runner: Optimizer Reformat( -> output) (Reformat[0x80000006]) [03/29/2024-17:37:32] [TRT] [V] Setting a default quantization params because quantization data is missing for [03/29/2024-17:37:32] [TRT] [V] Tactic: 0x00000000000003e8 Time: 0.00406451 [03/29/2024-17:37:32] [TRT] [V] Setting a default quantization params because quantization data is missing for [03/29/2024-17:37:32] [TRT] [V] Tactic: 0x00000000000003ea Time: 0.00652882 [03/29/2024-17:37:32] [TRT] [V] Setting a default quantization params because quantization data is missing for [03/29/2024-17:37:33] [TRT] [V] Tactic: 0x0000000000000000 Time: 0.00409158 [03/29/2024-17:37:33] [TRT] [V] Optimizer Reformat( -> output) (Reformat[0x80000006]) profiling completed in 0.0299125 seconds. Fastest Tactic: 0x00000000000003e8 Time: 0.00406451 [03/29/2024-17:37:33] [TRT] [V] Adding reformat layer: Reformatted Output Tensor 0 to {ForeignNode[onnx::Gather_401...(Unnamed Layer 3201) [ElementWise]]} (output) from Half(49,1) to Float(49,1) [03/29/2024-17:37:33] [TRT] [V] Formats and tactics selection completed in 302.827 seconds. [03/29/2024-17:37:33] [TRT] [V] After reformat layers: 3 layers [03/29/2024-17:37:33] [TRT] [V] Total number of blocks in pre-optimized block assignment: 3 [03/29/2024-17:37:33] [TRT] [I] Detected 2 inputs and 1 output network tensors. [03/29/2024-17:39:34] [TRT] [V] Setting a default quantization params because quantization data is missing for [ShapeHostToDeviceCopy 0] [03/29/2024-17:39:34] [TRT] [V] Setting a default quantization params because quantization data is missing for {ForeignNode[onnx::Gather_401...(Unnamed Layer 3201) [ElementWise]]} [03/29/2024-17:39:34] [TRT] [V] Layer: [ShapeHostToDeviceCopy 0] Host Persistent: 4 Device Persistent: 0 Scratch Memory: 0 [03/29/2024-17:39:34] [TRT] [V] Layer: {ForeignNode[onnx::Gather_401...(Unnamed Layer* 3201) [ElementWise]]} Host Persistent: 32 Device Persistent: 0 Scratch Memory: 1115742720 [03/29/2024-17:39:34] [TRT] [V] Skipped printing memory information for 1 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0. [03/29/2024-17:39:34] [TRT] [I] Total Host Persistent Memory: 48 [03/29/2024-17:39:34] [TRT] [I] Total Device Persistent Memory: 0 [03/29/2024-17:39:34] [TRT] [I] Total Scratch Memory: 1115742720 [03/29/2024-17:39:34] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2166 MiB, GPU 6049 MiB [03/29/2024-17:39:34] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 3 steps to complete. [03/29/2024-17:39:34] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.014939ms to assign 3 blocks to 3 nodes requiring 1115743744 bytes. [03/29/2024-17:39:34] [TRT] [V] Total number of blocks in optimized block assignment: 3 [03/29/2024-17:39:34] [TRT] [I] Total Activation Memory: 1115743744 [03/29/2024-17:39:34] [TRT] [V] Total number of generated kernels selected for the engine: 0 [03/29/2024-17:39:34] [TRT] [V] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS [03/29/2024-17:39:34] [TRT] [V] Disabling unused tactic source: JIT_CONVOLUTIONS [03/29/2024-17:39:34] [TRT] [V] Engine generation completed in 425.016 seconds. [03/29/2024-17:39:34] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy. [03/29/2024-17:39:34] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights. [03/29/2024-17:39:34] [TRT] [W] Check verbose logs for the list of affected weights. [03/29/2024-17:39:34] [TRT] [W] - 256 weights are affected by this issue: Detected subnormal FP16 values.

others: when I set FP16 in front set INT8,it will fallback FP16

bernardrb commented 5 months ago

[03/29/2024-17:37:32] [TRT] [V] Autotuning Reformat: Half(49,1) -> Float(49,1)

[03/29/2024-17:37:33] [TRT] [V] Adding reformat layer: Reformatted Output Tensor 0 to {ForeignNode[onnx::Gather_401...(Unnamed Layer* 3201) [ElementWise]]} (output) from Half(49,1) to Float(49,1)

How many layers are affected? Since, it could be a necessary reformat layer that tensorrt adds at I/O. Refer to this for more info: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors

Please share the whole log, and .onnx file in a google drive for further help.

Had the same issue with a Reformat layer #2136

kakascode commented 5 months ago

[03/29/2024-17:37:32] [TRT] [V] Autotuning Reformat: Half(49,1) -> Float(49,1)

[03/29/2024-17:37:33] [TRT] [V] Adding reformat layer: Reformatted Output Tensor 0 to {ForeignNode[onnx::Gather_401...(Unnamed Layer* 3201) [ElementWise]]} (output) from Half(49,1) to Float(49,1)

How many layers are affected? Since, it could be a necessary reformat layer that tensorrt adds at I/O. Refer to this for more info: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors

Please share the whole log, and .onnx file in a google drive for further help.

Had the same issue with a Reformat layer #2136

I am sorry for my late response, i will check you method, thanks for help

lix19937 commented 5 months ago

When I use TensorRT for int8 quantization, I always encounter the accuracy fallback to fp32. The trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS parameter does not solve the issue. What should I do?"

What is your trtexec cmd ?

kakascode commented 5 months ago

When I use TensorRT for int8 quantization, I always encounter the accuracy fallback to fp32. The trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS parameter does not solve the issue. What should I do?"

What is your trtexec cmd ?

I didn't use the trtexec command; instead, I used my own script.


builder = trt.Builder(logger)  
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))  
parser  = trt.OnnxParser(network, logger)  
config  = builder.create_builder_config()  
config.max_workspace_size = (1 << 30) * 8  
config.set_flag(trt.BuilderFlag.FP16)  
config.set_flag(trt.BuilderFlag.INT8)  
config.set_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS)  
config.int8_calibrator = calib```

if I don't use `config.set_flag(trt.BuilderFlag.FP16)`, it will fallback fp32, otherwise, it will fallback fp16