NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.75k stars 2.13k forks source link

tensorrt8.2.3 inference time is 5ms, but 8.4.3 inference time is 80ms #2377

Closed ywfwyht closed 2 years ago

ywfwyht commented 2 years ago

Description

Hi, guys. After converting the onnx to the trt engine with the link below, The inference time is 5ms in trt8.2.3 and 80ms in trt8.4.3.

Environment

TensorRT Version: 8.4.3.1 NVIDIA GPU: 3080Ti NVIDIA Driver Version: 515 CUDA Version: 11.6 CUDNN Version: 8.4 Operating System: ubuntu18.04 Python Version (if applicable): 3.8.8 Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.11 Baremetal or Container (if so, version):

Relevant Files

https://github.com/ywfwyht/onnx_model/blob/main/0903_p28_t3_seg_simp.onnx

Steps To Reproduce

zerollzeng commented 2 years ago

I got seg fault on TRT 8.2.3(docker image 22.03):

[10/09/2022-08:43:12] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: -7067026478815706014
[10/09/2022-08:43:12] [V] [TRT] =============== Computing costs for
[10/09/2022-08:43:12] [V] [TRT] *************** Autotuning format combination: Float(1327104,20736,144,1) -> Float(165888,20736,144,1) ***************
[10/09/2022-08:43:12] [V] [TRT] --------------- Timing Runner: {ForeignNode[Reshape_111 + Transpose_112...Reshape_835]} (Myelin)
Segmentation fault (core dumped)
I have no name!@4cff0ed21e85:/workspace$ trtexec --onnx=/zeroz/temp/2377/0903_p28_t3_seg_simp.onnx --verbose

22.08 with TRT 8.4.2:

[10/09/2022-08:31:54] [I] === Performance summary ===
[10/09/2022-08:31:54] [I] Throughput: 68.1918 qps
[10/09/2022-08:31:54] [I] Latency: min = 15.8783 ms, max = 16.3865 ms, mean = 16.1759 ms, median = 16.1797 ms, percentile(99%) = 16.3816 ms
[10/09/2022-08:31:54] [I] Enqueue Time: min = 14.3593 ms, max = 14.8667 ms, mean = 14.6316 ms, median = 14.6232 ms, percentile(99%) = 14.864 ms
[10/09/2022-08:31:54] [I] H2D Latency: min = 1.43982 ms, max = 1.51929 ms, mean = 1.47669 ms, median = 1.47974 ms, percentile(99%) = 1.50665 ms
[10/09/2022-08:31:54] [I] GPU Compute Time: min = 14.3636 ms, max = 14.8584 ms, mean = 14.6431 ms, median = 14.6494 ms, percentile(99%) = 14.8408 ms
[10/09/2022-08:31:54] [I] D2H Latency: min = 0.0529785 ms, max = 0.0578613 ms, mean = 0.0560911 ms, median = 0.0560303 ms, percentile(99%) = 0.0577393 ms
[10/09/2022-08:31:54] [I] Total Host Walltime: 3.03556 s
[10/09/2022-08:31:54] [I] Total GPU Compute Time: 3.03113 s
&&&& PASSED TensorRT.trtexec [TensorRT v8402] # trtexec --onnx=0903_p28_t3_seg_simp.onnx --verbose

TRT 8.5(22.09) perf is closed to 22.08

zerollzeng commented 2 years ago

@ywfwyht how you run the model with TRT 8.2.3? also seems there are gaps for 8.4 between your result and mine.

ywfwyht commented 2 years ago

@ywfwyht how you run the model with TRT 8.2.3? also seems there are gaps for 8.4 between your result and mine.

If you use 8.2 you must turn on the option --best

ywfwyht commented 2 years ago

I got seg fault on TRT 8.2.3(docker image 22.03):

[10/09/2022-08:43:12] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: -7067026478815706014
[10/09/2022-08:43:12] [V] [TRT] =============== Computing costs for
[10/09/2022-08:43:12] [V] [TRT] *************** Autotuning format combination: Float(1327104,20736,144,1) -> Float(165888,20736,144,1) ***************
[10/09/2022-08:43:12] [V] [TRT] --------------- Timing Runner: {ForeignNode[Reshape_111 + Transpose_112...Reshape_835]} (Myelin)
Segmentation fault (core dumped)
I have no name!@4cff0ed21e85:/workspace$ trtexec --onnx=/zeroz/temp/2377/0903_p28_t3_seg_simp.onnx --verbose

22.08 with TRT 8.4.2:

[10/09/2022-08:31:54] [I] === Performance summary ===
[10/09/2022-08:31:54] [I] Throughput: 68.1918 qps
[10/09/2022-08:31:54] [I] Latency: min = 15.8783 ms, max = 16.3865 ms, mean = 16.1759 ms, median = 16.1797 ms, percentile(99%) = 16.3816 ms
[10/09/2022-08:31:54] [I] Enqueue Time: min = 14.3593 ms, max = 14.8667 ms, mean = 14.6316 ms, median = 14.6232 ms, percentile(99%) = 14.864 ms
[10/09/2022-08:31:54] [I] H2D Latency: min = 1.43982 ms, max = 1.51929 ms, mean = 1.47669 ms, median = 1.47974 ms, percentile(99%) = 1.50665 ms
[10/09/2022-08:31:54] [I] GPU Compute Time: min = 14.3636 ms, max = 14.8584 ms, mean = 14.6431 ms, median = 14.6494 ms, percentile(99%) = 14.8408 ms
[10/09/2022-08:31:54] [I] D2H Latency: min = 0.0529785 ms, max = 0.0578613 ms, mean = 0.0560911 ms, median = 0.0560303 ms, percentile(99%) = 0.0577393 ms
[10/09/2022-08:31:54] [I] Total Host Walltime: 3.03556 s
[10/09/2022-08:31:54] [I] Total GPU Compute Time: 3.03113 s
&&&& PASSED TensorRT.trtexec [TensorRT v8402] # trtexec --onnx=0903_p28_t3_seg_simp.onnx --verbose

TRT 8.5(22.09) perf is closed to 22.08

Which is the inference time?
My inference code is based on your sample https://github.com/NVIDIA/TensorRT/blob/main/samples/python/yolov3_onnx/onnx_to_tensorrt.py

ywfwyht commented 2 years ago

@zerollzeng Can you tell me about your environment? I still get an error with trt8.4.

ywfwyht commented 2 years ago

When I use polygraphy, it makes an error, but when I use trtexec, it doesn't make an error.

 polygraphy run submodel_backbone.onnx \
          --trt \
          --onnxrt \
          --pool-limit workspace:8G \
          --save-engine=submodel_backbone.trt \
          --atol 1e-3 --rtol 1e-3 \
          --verbose \
          --trt-outputs mark all \
          --onnx-outputs mark all \
          --fail-fast \
          --val-range [0,1]
[10/10/2022-02:31:06] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +750, GPU +318, now: CPU 1494, GPU 2208 (MiB)
[10/10/2022-02:31:06] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +127, GPU +60, now: CPU 1621, GPU 2268 (MiB)
[10/10/2022-02:31:06] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[10/10/2022-02:31:06] [TRT] [W] Skipping tactic 0x0000000000000000 due to Myelin error: Formal output tensor "1250 _ (Unnamed Layer_ 4) [Shuffle]_constant" is also a data tensor.
[10/10/2022-02:31:06] [TRT] [E] 10: [optimizer.cpp::computeCosts::3626] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[Reshape_111...Reshape_835]}.)
[10/10/2022-02:31:06] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[!] Invalid Engine. Please ensure the engine was built correctly
[10/10/2022-02:40:17] [I] === Profile (951 iterations ) ===
[10/10/2022-02:40:17] [I]                                                     Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[10/10/2022-02:40:17] [I]  {ForeignNode[Reshape_111 + Transpose_112...Reshape_835]}      754.56           0.7934             0.7926     86.6
[10/10/2022-02:40:17] [I]                                                  Conv_836      117.01           0.1230             0.1229     13.4
[10/10/2022-02:40:17] [I]                                                     Total      871.57           0.9165             0.9156    100.0
[10/10/2022-02:40:17] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8403] # /workspace/work_dir/TensorRT-8.4.3.1/bin/trtexec --onnx=submodel_backbone.onnx --saveEngine=/workspace/work_dir/K-Lane/submodel_backbone_trtexec.trt --workspace=12000 --useCudaGraph --dumpProfile
zerollzeng commented 2 years ago

Which is the inference time?

median GPU compute time.

-> Can you tell me about your environment? I still get an error with trt8.4. I used the offcial docker images, maybe it's due to cuda version.

--best means enable FP16 and INT8, without this is running in FP32, TRT 8.2

[10/11/2022-16:29:05] [I] GPU Compute Time: min = 3.6731 ms, max = 3.76428 ms, mean = 3.72586 ms, median = 3.73242 ms, percentile(99%) = 3.7489 ms
trtexec --onnx=/zeroz/temp/2377/0903_p28_t3_seg_simp.onnx --verbose --best

TRT 8.4

[10/11/2022-16:37:47] [I] GPU Compute Time: min = 3.47852 ms, max = 3.57684 ms, mean = 3.51626 ms, median = 3.52051 ms, percentile(99%) = 3.54614 ms
&&&& PASSED TensorRT.trtexec [TensorRT v8402] # trtexec --onnx=/zeroz/temp/2377/0903_p28_t3_seg_simp.onnx --verbose --best

Looks like no regression in TRT 8.4.

ywfwyht commented 2 years ago

I also suspect it's due to the cuda version, TensorRT Version: 8.4.3.1 CUDA Version: 11.6 CUDNN Version: 8.4 NVIDIA Driver Version: 515. Looks like no problem.

ywfwyht commented 2 years ago

@zerollzeng How do I write inference code? I refer to this, right? https://github.com/NVIDIA/TensorRT/blob/main/samples/python/yolov3_onnx/onnx_to_tensorrt.py

zerollzeng commented 2 years ago

Should work. please also refer to the python api doc.