Closed dmenig closed 3 years ago
On those models, I see more of a -28% regression in speed.
Hello @hyperfraise , could you provide the perf number again using trtexec
with option --noDataTransfers --dumpProfile --separateProfiling
, and attach both log here? thanks!
I think you meant --seperateProfileRun
. I added those three options and here are the logs :
20.11 :
root@6ec22ad980d8:/veesion# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
[04/30/2021-08:51:33] [I] === Model Options ===
[04/30/2021-08:51:33] [I] Format: ONNX
[04/30/2021-08:51:33] [I] Model: resnet.onnx
[04/30/2021-08:51:33] [I] Output:
[04/30/2021-08:51:33] [I] === Build Options ===
[04/30/2021-08:51:33] [I] Max batch: explicit
[04/30/2021-08:51:33] [I] Workspace: 5000 MiB
[04/30/2021-08:51:33] [I] minTiming: 1
[04/30/2021-08:51:33] [I] avgTiming: 8
[04/30/2021-08:51:33] [I] Precision: FP32+FP16+INT8
[04/30/2021-08:51:33] [I] Calibration: Dynamic
[04/30/2021-08:51:33] [I] Refit: Disabled
[04/30/2021-08:51:33] [I] Safe mode: Disabled
[04/30/2021-08:51:33] [I] Save engine: resnet.trt
[04/30/2021-08:51:33] [I] Load engine:
[04/30/2021-08:51:33] [I] Builder Cache: Enabled
[04/30/2021-08:51:33] [I] NVTX verbosity: 0
[04/30/2021-08:51:33] [I] Tactic sources: Using default tactic sources
[04/30/2021-08:51:33] [I] Input(s): fp32:chw
[04/30/2021-08:51:33] [I] Output(s): fp32:chw
[04/30/2021-08:51:33] [I] Input build shapes: model
[04/30/2021-08:51:33] [I] Input calibration shapes: model
[04/30/2021-08:51:33] [I] === System Options ===
[04/30/2021-08:51:33] [I] Device: 0
[04/30/2021-08:51:33] [I] DLACore:
[04/30/2021-08:51:33] [I] Plugins:
[04/30/2021-08:51:33] [I] === Inference Options ===
[04/30/2021-08:51:33] [I] Batch: Explicit
[04/30/2021-08:51:33] [I] Input inference shapes: model
[04/30/2021-08:51:33] [I] Iterations: 10
[04/30/2021-08:51:33] [I] Duration: 3s (+ 200ms warm up)
[04/30/2021-08:51:33] [I] Sleep time: 0ms
[04/30/2021-08:51:33] [I] Streams: 1
[04/30/2021-08:51:33] [I] ExposeDMA: Disabled
[04/30/2021-08:51:33] [I] Data transfers: Disabled
[04/30/2021-08:51:33] [I] Spin-wait: Disabled
[04/30/2021-08:51:33] [I] Multithreading: Disabled
[04/30/2021-08:51:33] [I] CUDA Graph: Disabled
[04/30/2021-08:51:33] [I] Separate profiling: Enabled
[04/30/2021-08:51:33] [I] Skip inference: Disabled
[04/30/2021-08:51:33] [I] Inputs:
[04/30/2021-08:51:33] [I] === Reporting Options ===
[04/30/2021-08:51:33] [I] Verbose: Disabled
[04/30/2021-08:51:33] [I] Averages: 10 inferences
[04/30/2021-08:51:33] [I] Percentile: 99
[04/30/2021-08:51:33] [I] Dump refittable layers:Disabled
[04/30/2021-08:51:33] [I] Dump output: Disabled
[04/30/2021-08:51:33] [I] Profile: Enabled
[04/30/2021-08:51:33] [I] Export timing to JSON file:
[04/30/2021-08:51:33] [I] Export output to JSON file:
[04/30/2021-08:51:33] [I] Export profile to JSON file:
[04/30/2021-08:51:33] [I]
[04/30/2021-08:51:33] [I] === Device Information ===
[04/30/2021-08:51:33] [I] Selected Device: GeForce GTX 1080 Ti
[04/30/2021-08:51:33] [I] Compute Capability: 6.1
[04/30/2021-08:51:33] [I] SMs: 28
[04/30/2021-08:51:33] [I] Compute Clock Rate: 1.6325 GHz
[04/30/2021-08:51:33] [I] Device Global Memory: 11178 MiB
[04/30/2021-08:51:33] [I] Shared Memory per SM: 96 KiB
[04/30/2021-08:51:33] [I] Memory Bus Width: 352 bits (ECC disabled)
[04/30/2021-08:51:33] [I] Memory Clock Rate: 5.505 GHz
[04/30/2021-08:51:33] [I]
----------------------------------------------------------------
Input filename: resnet.onnx
ONNX IR version: 0.0.6
Opset version: 9
Producer name: pytorch
Producer version: 1.8
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[04/30/2021-08:51:46] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[04/30/2021-08:51:46] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[04/30/2021-08:54:01] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[04/30/2021-08:54:11] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[04/30/2021-08:54:12] [I] Engine built in 158.76 sec.
[04/30/2021-08:54:12] [I] Starting inference
[04/30/2021-08:54:18] [I] Warmup completed 0 queries over 200 ms
[04/30/2021-08:54:18] [I] Timing trace has 0 queries over 5.67147 s
[04/30/2021-08:54:18] [I] Trace averages of 10 runs:
[04/30/2021-08:54:18] [I] Average on 10 runs - GPU latency: 567.144 ms - Host latency: 567.144 ms (end to end 567.144 ms, enqueue 5.0075 ms)
[04/30/2021-08:54:18] [I] Host Latency
[04/30/2021-08:54:18] [I] min: 564.432 ms (end to end 564.432 ms)
[04/30/2021-08:54:18] [I] max: 569.332 ms (end to end 569.332 ms)
[04/30/2021-08:54:18] [I] mean: 567.144 ms (end to end 567.144 ms)
[04/30/2021-08:54:18] [I] median: 567.073 ms (end to end 567.073 ms)
[04/30/2021-08:54:18] [I] percentile: 569.332 ms at 99% (end to end 569.332 ms at 99%)
[04/30/2021-08:54:18] [I] throughput: 0 qps
[04/30/2021-08:54:18] [I] walltime: 5.67147 s
[04/30/2021-08:54:18] [I] Enqueue Time
[04/30/2021-08:54:18] [I] min: 1.9541 ms
[04/30/2021-08:54:18] [I] max: 8.28448 ms
[04/30/2021-08:54:18] [I] median: 4.74475 ms
[04/30/2021-08:54:18] [I] GPU Compute
[04/30/2021-08:54:18] [I] min: 564.432 ms
[04/30/2021-08:54:18] [I] max: 569.332 ms
[04/30/2021-08:54:18] [I] mean: 567.144 ms
[04/30/2021-08:54:18] [I] median: 567.073 ms
[04/30/2021-08:54:18] [I] percentile: 569.332 ms at 99%
[04/30/2021-08:54:18] [I] total compute time: 5.67144 s
[04/30/2021-08:54:25] [I]
[04/30/2021-08:54:25] [I] === Profile (11 iterations ) ===
[04/30/2021-08:54:25] [I] Layer Time (ms) Avg. Time (ms) Time %
[04/30/2021-08:54:25] [I] Conv_0 + Relu_1 input reformatter 0 6.97 0.6335 0.1
[04/30/2021-08:54:25] [I] Conv_0 + Relu_1 173.17 15.7430 2.8
[04/30/2021-08:54:25] [I] Conv_2 + Relu_3 169.07 15.3696 2.7
[04/30/2021-08:54:25] [I] Conv_4 + Relu_5 623.77 56.7059 9.9
[04/30/2021-08:54:25] [I] Conv_6 + Relu_7 272.34 24.7580 4.3
[04/30/2021-08:54:25] [I] Conv_8 + Relu_9 633.04 57.5493 10.1
[04/30/2021-08:54:25] [I] Conv_10 + Add_11 + Relu_12 295.63 26.8753 4.7
[04/30/2021-08:54:25] [I] Conv_13 + Relu_14 631.06 57.3693 10.1
[04/30/2021-08:54:25] [I] Conv_15 + Relu_16 271.87 24.7155 4.3
[04/30/2021-08:54:25] [I] Conv_17 + Relu_18 627.31 57.0278 10.0
[04/30/2021-08:54:25] [I] Conv_19 + Add_20 + Relu_21 294.50 26.7729 4.7
[04/30/2021-08:54:25] [I] Conv_22 + Relu_23 259.98 23.6343 4.1
[04/30/2021-08:54:25] [I] Conv_24 + Relu_25 68.20 6.2004 1.1
[04/30/2021-08:54:25] [I] Conv_26 + Relu_27 input reformatter 0 7.67 0.6973 0.1
[04/30/2021-08:54:25] [I] Conv_26 + Relu_27 199.37 18.1245 3.2
[04/30/2021-08:54:25] [I] Conv_28 input reformatter 0 15.76 1.4331 0.3
[04/30/2021-08:54:25] [I] Conv_28 63.83 5.8026 1.0
[04/30/2021-08:54:25] [I] Conv_29 + Add_30 + Relu_31 30.52 2.7749 0.5
[04/30/2021-08:54:25] [I] Conv_32 + Relu_33 253.58 23.0530 4.0
[04/30/2021-08:54:25] [I] Conv_34 + Relu_35 78.28 7.1166 1.2
[04/30/2021-08:54:25] [I] Conv_36 + Relu_37 253.48 23.0439 4.0
[04/30/2021-08:54:25] [I] Conv_38 + Add_39 + Relu_40 83.95 7.6318 1.3
[04/30/2021-08:54:25] [I] Conv_41 + Relu_42 146.97 13.3608 2.3
[04/30/2021-08:54:25] [I] Conv_43 + Relu_44 28.73 2.6118 0.5
[04/30/2021-08:54:25] [I] Conv_45 + Relu_46 input reformatter 0 1.96 0.1781 0.0
[04/30/2021-08:54:25] [I] Conv_45 + Relu_46 89.20 8.1094 1.4
[04/30/2021-08:54:25] [I] Conv_47 input reformatter 0 3.83 0.3485 0.1
[04/30/2021-08:54:25] [I] Conv_47 30.85 2.8043 0.5
[04/30/2021-08:54:25] [I] Conv_48 + Add_49 + Relu_50 9.89 0.8994 0.2
[04/30/2021-08:54:25] [I] Conv_51 + Relu_52 input reformatter 0 1.97 0.1787 0.0
[04/30/2021-08:54:25] [I] Conv_51 + Relu_52 101.72 9.2470 1.6
[04/30/2021-08:54:25] [I] Conv_53 + Relu_54 41.94 3.8130 0.7
[04/30/2021-08:54:25] [I] Conv_55 + Relu_56 101.44 9.2214 1.6
[04/30/2021-08:54:25] [I] Conv_57 + Add_58 + Relu_59 input reformatter 0 4.80 0.4360 0.1
[04/30/2021-08:54:25] [I] Conv_57 + Add_58 + Relu_59 39.64 3.6038 0.6
[04/30/2021-08:54:25] [I] Conv_60 + Relu_61 input reformatter 0 1.95 0.1777 0.0
[04/30/2021-08:54:25] [I] Conv_60 + Relu_61 122.45 11.1317 2.0
[04/30/2021-08:54:25] [I] Conv_62 + Relu_63 17.26 1.5688 0.3
[04/30/2021-08:54:25] [I] Conv_64 + Relu_65 input reformatter 0 0.65 0.0587 0.0
[04/30/2021-08:54:25] [I] Conv_64 + Relu_65 49.33 4.4849 0.8
[04/30/2021-08:54:25] [I] Conv_66 15.96 1.4506 0.3
[04/30/2021-08:54:25] [I] Conv_67 + Add_68 + Relu_69 3.93 0.3571 0.1
[04/30/2021-08:54:25] [I] Conv_70 + Relu_71 52.28 4.7532 0.8
[04/30/2021-08:54:25] [I] Conv_72 + Relu_73 19.74 1.7943 0.3
[04/30/2021-08:54:25] [I] Conv_74 + Relu_75 52.42 4.7657 0.8
[04/30/2021-08:54:25] [I] Conv_76 + Add_77 + Relu_78 20.24 1.8400 0.3
[04/30/2021-08:54:25] [I] GlobalAveragePool_79 1.13 0.1028 0.0
[04/30/2021-08:54:25] [I] Flatten_80 + (Unnamed Layer* 81) [Shuffle] input reformatter 0 0.09 0.0079 0.0
[04/30/2021-08:54:25] [I] Gemm_81 0.18 0.0163 0.0
[04/30/2021-08:54:25] [I] Total 6273.90 570.3544 100.0
[04/30/2021-08:54:25] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
20.12 :
root@10fb4bdae972:/veesion# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --d
umpProfile --separateProfileRun
[04/30/2021-08:36:35] [I] === Model Options ===
[04/30/2021-08:36:35] [I] Format: ONNX
[04/30/2021-08:36:35] [I] Model: resnet.onnx
[04/30/2021-08:36:35] [I] Output:
[04/30/2021-08:36:35] [I] === Build Options ===
[04/30/2021-08:36:35] [I] Max batch: explicit
[04/30/2021-08:36:35] [I] Workspace: 5000 MiB
[04/30/2021-08:36:35] [I] minTiming: 1
[04/30/2021-08:36:35] [I] avgTiming: 8
[04/30/2021-08:36:35] [I] Precision: FP32+FP16+INT8
[04/30/2021-08:36:35] [I] Calibration: Dynamic
[04/30/2021-08:36:35] [I] Refit: Disabled
[04/30/2021-08:36:35] [I] Safe mode: Disabled
[04/30/2021-08:36:35] [I] Save engine: resnet.trt
[04/30/2021-08:36:35] [I] Load engine:
[04/30/2021-08:36:35] [I] Builder Cache: Enabled [04/30/2021-08:36:35] [I] NVTX verbosity: 0
[04/30/2021-08:36:35] [I] Tactic sources: Using default tactic sources
[04/30/2021-08:36:35] [I] Input(s): fp32:chw
[04/30/2021-08:36:35] [I] Output(s): fp32:chw
[04/30/2021-08:36:35] [I] Input build shapes: model
[04/30/2021-08:36:35] [I] Input calibration shapes: model
[04/30/2021-08:36:35] [I] === System Options ===
[04/30/2021-08:36:35] [I] Device: 0
[04/30/2021-08:36:35] [I] DLACore:
[04/30/2021-08:36:35] [I] Plugins:
[04/30/2021-08:36:35] [I] === Inference Options ===
[04/30/2021-08:36:35] [I] Batch: Explicit
[04/30/2021-08:36:35] [I] Input inference shapes: model
[04/30/2021-08:36:35] [I] Iterations: 10
[04/30/2021-08:36:35] [I] Duration: 3s (+ 200ms warm up)
[04/30/2021-08:36:35] [I] Sleep time: 0ms
[04/30/2021-08:36:35] [I] Streams: 1
[04/30/2021-08:36:35] [I] ExposeDMA: Disabled
[04/30/2021-08:36:35] [I] Data transfers: Disabled
[04/30/2021-08:36:35] [I] Spin-wait: Disabled
[04/30/2021-08:36:35] [I] Multithreading: Disabled
[04/30/2021-08:36:35] [I] CUDA Graph: Disabled
[04/30/2021-08:36:35] [I] Separate profiling: Enabled
[04/30/2021-08:36:35] [I] Skip inference: Disabled
[04/30/2021-08:36:35] [I] Inputs:
[04/30/2021-08:36:35] [I] === Reporting Options ===
[04/30/2021-08:36:35] [I] Verbose: Disabled
[04/30/2021-08:36:35] [I] Averages: 10 inferences
[04/30/2021-08:36:35] [I] Percentile: 99
[04/30/2021-08:36:35] [I] Dump refittable layers:Disabled
[04/30/2021-08:36:35] [I] Dump output: Disabled
[04/30/2021-08:36:35] [I] Profile: Enabled
[04/30/2021-08:36:35] [I] Export timing to JSON file:
[04/30/2021-08:36:35] [I] Export output to JSON file:
[04/30/2021-08:36:35] [I] Export profile to JSON file:
[04/30/2021-08:36:35] [I]
[04/30/2021-08:36:35] [I] === Device Information ===
[04/30/2021-08:36:35] [I] Selected Device: GeForce GTX 1080 Ti
[04/30/2021-08:36:35] [I] Compute Capability: 6.1
[04/30/2021-08:36:35] [I] SMs: 28
[04/30/2021-08:36:35] [I] Compute Clock Rate: 1.6325 GHz
[04/30/2021-08:36:35] [I] Device Global Memory: 11178 MiB
[04/30/2021-08:36:35] [I] Shared Memory per SM: 96 KiB
[04/30/2021-08:36:35] [I] Memory Bus Width: 352 bits (ECC disabled)
[04/30/2021-08:36:35] [I] Memory Clock Rate: 5.505 GHz
[04/30/2021-08:36:35] [I]
----------------------------------------------------------------
Input filename: resnet.onnx
ONNX IR version: 0.0.6
Opset version: 9
Producer name: pytorch
Producer version: 1.8
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[04/30/2021-08:36:49] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[04/30/2021-08:36:49] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[04/30/2021-08:39:06] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[04/30/2021-08:39:16] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[04/30/2021-08:39:17] [I] Engine built in 161.698 sec.
[04/30/2021-08:39:17] [I] Starting inference
[04/30/2021-08:39:24] [I] Warmup completed 0 queries over 200 ms
[04/30/2021-08:39:24] [I] Timing trace has 0 queries over 6.02128 s
[04/30/2021-08:39:24] [I] Trace averages of 10 runs:
[04/30/2021-08:39:24] [I] Average on 10 runs - GPU latency: 602.126 ms - Host latency: 602.126 ms (end to end 602.126 ms, enqueue 3.9255 ms)
[04/30/2021-08:39:24] [I] Host Latency
[04/30/2021-08:39:24] [I] min: 599.799 ms (end to end 599.799 ms)
[04/30/2021-08:39:24] [I] max: 603.952 ms (end to end 603.952 ms)
[04/30/2021-08:39:24] [I] mean: 602.126 ms (end to end 602.126 ms)
[04/30/2021-08:39:24] [I] median: 602.148 ms (end to end 602.148 ms)
[04/30/2021-08:39:24] [I] percentile: 603.952 ms at 99% (end to end 603.952 ms at 99%)
[04/30/2021-08:39:24] [I] throughput: 0 qps
[04/30/2021-08:39:24] [I] walltime: 6.02128 s
[04/30/2021-08:39:24] [I] Enqueue Time
[04/30/2021-08:39:24] [I] min: 3.33887 ms
[04/30/2021-08:39:24] [I] max: 4.47217 ms
[04/30/2021-08:39:24] [I] median: 3.82956 ms
[04/30/2021-08:39:24] [I] GPU Compute
[04/30/2021-08:39:24] [I] min: 599.799 ms
[04/30/2021-08:39:24] [I] max: 603.952 ms
[04/30/2021-08:39:24] [I] mean: 602.126 ms
[04/30/2021-08:39:24] [I] median: 602.148 ms
[04/30/2021-08:39:24] [I] percentile: 603.952 ms at 99%
[04/30/2021-08:39:24] [I] total compute time: 6.02126 s
[04/30/2021-08:39:30] [I]
[04/30/2021-08:39:30] [I] === Profile (11 iterations ) ===
[04/30/2021-08:39:30] [I] Layer Time (ms) Avg. Time (ms) Time %
[04/30/2021-08:39:30] [I] Conv_0 + Relu_1 input reformatter 0 6.80 0.6186 0.1
[04/30/2021-08:39:30] [I] Conv_0 + Relu_1 161.23 14.6573 2.4
[04/30/2021-08:39:30] [I] Conv_2 + Relu_3 140.68 12.7886 2.1
[04/30/2021-08:39:30] [I] Conv_4 + Relu_5 593.23 53.9304 8.9
[04/30/2021-08:39:30] [I] Conv_6 + Relu_7 221.12 20.1014 3.3
[04/30/2021-08:39:30] [I] Conv_8 + Relu_9 616.47 56.0428 9.3
[04/30/2021-08:39:30] [I] Conv_10 + Add_11 + Relu_12 245.15 22.2864 3.7
[04/30/2021-08:39:30] [I] Conv_13 + Relu_14 611.87 55.6248 9.2
[04/30/2021-08:39:30] [I] Conv_15 + Relu_16 220.98 20.0887 3.3
[04/30/2021-08:39:30] [I] Conv_17 + Relu_18 608.63 55.3298 9.2
[04/30/2021-08:39:30] [I] Conv_19 + Add_20 + Relu_21 244.15 22.1955 3.7
[04/30/2021-08:39:30] [I] Conv_19 + Add_20 + Relu_21 output reformatter 0 28.64 2.6038 0.4
[04/30/2021-08:39:30] [I] Conv_22 + Relu_23 308.68 28.0622 4.7
[04/30/2021-08:39:30] [I] Conv_24 + Relu_25 77.74 7.0669 1.2
[04/30/2021-08:39:30] [I] Conv_26 + Relu_27 280.21 25.4732 4.2
[04/30/2021-08:39:30] [I] Conv_28 70.68 6.4256 1.1
[04/30/2021-08:39:30] [I] Conv_29 + Add_30 + Relu_31 38.21 3.4738 0.6
[04/30/2021-08:39:30] [I] Conv_32 + Relu_33 409.39 37.2171 6.2
[04/30/2021-08:39:30] [I] Conv_34 + Relu_35 88.38 8.0343 1.3
[04/30/2021-08:39:30] [I] Conv_36 + Relu_37 408.98 37.1797 6.2
[04/30/2021-08:39:30] [I] Conv_38 + Add_39 + Relu_40 98.79 8.9807 1.5
[04/30/2021-08:39:30] [I] Conv_38 + Add_39 + Relu_40 output reformatter 0 8.49 0.7723 0.1
[04/30/2021-08:39:30] [I] Conv_41 + Relu_42 140.08 12.7344 2.1
[04/30/2021-08:39:30] [I] Conv_43 + Relu_44 27.54 2.5037 0.4
[04/30/2021-08:39:30] [I] Conv_45 + Relu_46 120.52 10.9566 1.8
[04/30/2021-08:39:30] [I] Conv_47 37.47 3.4063 0.6
[04/30/2021-08:39:30] [I] Conv_48 + Add_49 + Relu_50 9.55 0.8679 0.1
[04/30/2021-08:39:30] [I] Conv_51 + Relu_52 175.88 15.9887 2.7
[04/30/2021-08:39:30] [I] Conv_53 + Relu_54 46.05 4.1864 0.7
[04/30/2021-08:39:30] [I] Conv_55 + Relu_56 175.45 15.9497 2.6
[04/30/2021-08:39:30] [I] Conv_57 + Add_58 + Relu_59 47.57 4.3244 0.7
[04/30/2021-08:39:30] [I] Conv_60 + Relu_61 input reformatter 0 1.87 0.1696 0.0
[04/30/2021-08:39:30] [I] Conv_60 + Relu_61 43.41 3.9463 0.7
[04/30/2021-08:39:30] [I] Conv_62 + Relu_63 input reformatter 0 1.90 0.1725 0.0
[04/30/2021-08:39:30] [I] Conv_62 + Relu_63 18.08 1.6440 0.3
[04/30/2021-08:39:30] [I] Conv_64 + Relu_65 77.87 7.0794 1.2
[04/30/2021-08:39:30] [I] Conv_66 17.61 1.6013 0.3
[04/30/2021-08:39:30] [I] Conv_67 + Add_68 + Relu_69 3.73 0.3395 0.1
[04/30/2021-08:39:30] [I] Conv_70 + Relu_71 79.39 7.2176 1.2
[04/30/2021-08:39:30] [I] Conv_72 + Relu_73 21.48 1.9528 0.3
[04/30/2021-08:39:30] [I] Conv_74 + Relu_75 78.93 7.1758 1.2
[04/30/2021-08:39:30] [I] Conv_76 + Add_77 + Relu_78 21.82 1.9834 0.3
[04/30/2021-08:39:30] [I] GlobalAveragePool_79 1.09 0.0989 0.0
[04/30/2021-08:39:30] [I] Flatten_80 + (Unnamed Layer* 81) [Shuffle] input reformatter 0 0.08 0.0069 0.0
[04/30/2021-08:39:30] [I] Gemm_81 0.17 0.0153 0.0
[04/30/2021-08:39:30] [I] Total 6636.03 603.2751 100.0
[04/30/2021-08:39:30] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
Hello @hyperfraise , compare gpu compute median
,
[04/30/2021-08:54:18] [I] median: 567.073 ms
with
[04/30/2021-08:39:24] [I] median: 602.148 ms
it is (602.148-567.073)/567.073 = 6.1%, I did not see 28% regression. Could you also take a check? thanks
Hello @hyperfraise , compare
gpu compute median
,[04/30/2021-08:54:18] [I] median: 567.073 ms
with
[04/30/2021-08:39:24] [I] median: 602.148 ms
it is (602.148-567.073)/567.073 = 6.1%, I did not see 28% regression. Could you also take a check? thanks
I was talking about another model for 28% regression. For this model I observed 7.5 spl/s compared to 6.8, which is ~10% regression. I don't think the stats from the dump are perfectly representative of the real mean inference speed. 6% is still a regression anyways (enough to make me think the problem is the same than the one I osberve with the other model, which has 28% regression !)
So @ttyio do you think you guys might be able to do something to solve this ?
Hello @hyperfraise , I can repro the 6% regression on Pascal device, but given the limited develop bandwidth, sorry it is not in top priority queue. Could you try latest 8.0 EA release? thanks
Sure, I'll just wait for the nvcr release of TensorRT 8.0 in a new docker image. I hope you guys don't feel overwhelmed and manage to code stress free.
Oh. FYI on 21.05 (which is TensorRT 7.2.3-1), results are even worse :/
root@5e3e8fe1e488:/veesion# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --d
umpProfile --separateProfileRun
[05/21/2021-09:56:36] [I] === Model Options ===
[05/21/2021-09:56:36] [I] Format: ONNX
[05/21/2021-09:56:36] [I] Model: resnet.onnx
[05/21/2021-09:56:36] [I] Output:
[05/21/2021-09:56:36] [I] === Build Options ===
[05/21/2021-09:56:36] [I] Max batch: explicit
[05/21/2021-09:56:36] [I] Workspace: 5000 MiB
[05/21/2021-09:56:36] [I] minTiming: 1
[05/21/2021-09:56:36] [I] avgTiming: 8
[05/21/2021-09:56:36] [I] Precision: FP32+FP16+INT8
[05/21/2021-09:56:36] [I] Calibration: Dynamic
[05/21/2021-09:56:36] [I] Refit: Disabled
[05/21/2021-09:56:36] [I] Safe mode: Disabled
[05/21/2021-09:56:36] [I] Save engine: resnet.trt
[05/21/2021-09:56:36] [I] Load engine:
[05/21/2021-09:56:36] [I] Builder Cache: Enabled
[05/21/2021-09:56:36] [I] NVTX verbosity: 0
[05/21/2021-09:56:36] [I] Tactic sources: Using default tactic sources
[05/21/2021-09:56:36] [I] Input(s): fp32:chw
[05/21/2021-09:56:36] [I] Output(s): fp32:chw
[05/21/2021-09:56:36] [I] Input build shapes: model
[05/21/2021-09:56:36] [I] Input calibration shapes: model
[05/21/2021-09:56:36] [I] === System Options ===
[05/21/2021-09:56:36] [I] Device: 0
[05/21/2021-09:56:36] [I] DLACore:
[05/21/2021-09:56:36] [I] Plugins:
[05/21/2021-09:56:36] [I] === Inference Options ===
[05/21/2021-09:56:36] [I] Batch: Explicit
[05/21/2021-09:56:36] [I] Input inference shapes: model
[05/21/2021-09:56:36] [I] Iterations: 10
[05/21/2021-09:56:36] [I] Duration: 3s (+ 200ms warm up)
[05/21/2021-09:56:36] [I] Sleep time: 0ms
[05/21/2021-09:56:36] [I] Streams: 1
[05/21/2021-09:56:36] [I] ExposeDMA: Disabled
[05/21/2021-09:56:36] [I] Data transfers: Disabled
[05/21/2021-09:56:36] [I] Spin-wait: Disabled
[05/21/2021-09:56:36] [I] Multithreading: Disabled
[05/21/2021-09:56:36] [I] CUDA Graph: Disabled
[05/21/2021-09:56:36] [I] Separate profiling: Enabled
[05/21/2021-09:56:36] [I] Skip inference: Disabled
[05/21/2021-09:56:36] [I] Inputs:
[05/21/2021-09:56:36] [I] === Reporting Options ===
[05/21/2021-09:56:36] [I] Verbose: Disabled
[05/21/2021-09:56:36] [I] Averages: 10 inferences
[05/21/2021-09:56:36] [I] Percentile: 99
[05/21/2021-09:56:36] [I] Dump refittable layers:Disabled
[05/21/2021-09:56:36] [I] Dump output: Disabled
[05/21/2021-09:56:36] [I] Profile: Enabled
[05/21/2021-09:56:36] [I] Export timing to JSON file:
[05/21/2021-09:56:36] [I] Export output to JSON file:
[05/21/2021-09:56:36] [I] Export profile to JSON file:
[05/21/2021-09:56:36] [I]
[05/21/2021-09:56:36] [I] === Device Information ===
[05/21/2021-09:56:36] [I] Selected Device: NVIDIA GeForce GTX 1080 Ti
[05/21/2021-09:56:36] [I] Compute Capability: 6.1
[05/21/2021-09:56:36] [I] SMs: 28
[05/21/2021-09:56:36] [I] Compute Clock Rate: 1.6325 GHz
[05/21/2021-09:56:36] [I] Device Global Memory: 11178 MiB
[05/21/2021-09:56:36] [I] Shared Memory per SM: 96 KiB
[05/21/2021-09:56:36] [I] Memory Bus Width: 352 bits (ECC disabled)
[05/21/2021-09:56:36] [I] Memory Clock Rate: 5.505 GHz
[05/21/2021-09:56:36] [I]
[05/21/2021-09:56:49] [I] [TRT] ----------------------------------------------------------------
[05/21/2021-09:56:49] [I] [TRT] Input filename: resnet.onnx
[05/21/2021-09:56:49] [I] [TRT] ONNX IR version: 0.0.6
[05/21/2021-09:56:49] [I] [TRT] Opset version: 9
[05/21/2021-09:56:49] [I] [TRT] Producer name: pytorch
[05/21/2021-09:56:49] [I] [TRT] Producer version: 1.8
[05/21/2021-09:56:49] [I] [TRT] Domain:
[05/21/2021-09:56:49] [I] [TRT] Model version: 0
[05/21/2021-09:56:49] [I] [TRT] Doc string:
[05/21/2021-09:56:49] [I] [TRT] ----------------------------------------------------------------
[05/21/2021-09:56:49] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[05/21/2021-09:56:49] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[05/21/2021-09:59:09] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[05/21/2021-09:59:21] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[05/21/2021-09:59:22] [I] Engine built in 165.295 sec.
[05/21/2021-09:59:22] [I] Starting inference
[05/21/2021-09:59:29] [I] Warmup completed 0 queries over 200 ms
[05/21/2021-09:59:29] [I] Timing trace has 0 queries over 6.43171 s
[05/21/2021-09:59:29] [I] Trace averages of 10 runs:
[05/21/2021-09:59:29] [I] Average on 10 runs - GPU latency: 643.169 ms - Host latency: 643.169 ms (end to end 643.169 ms, enqueue 4.60832 ms)
[05/21/2021-09:59:29] [I] Host Latency
[05/21/2021-09:59:29] [I] min: 638.401 ms (end to end 638.401 ms)
[05/21/2021-09:59:29] [I] max: 646.992 ms (end to end 646.992 ms)
[05/21/2021-09:59:29] [I] mean: 643.169 ms (end to end 643.169 ms)
[05/21/2021-09:59:29] [I] median: 643.202 ms (end to end 643.202 ms)
[05/21/2021-09:59:29] [I] percentile: 646.992 ms at 99% (end to end 646.992 ms at 99%)
[05/21/2021-09:59:29] [I] throughput: 0 qps
[05/21/2021-09:59:29] [I] walltime: 6.43171 s
[05/21/2021-09:59:29] [I] Enqueue Time
[05/21/2021-09:59:29] [I] min: 4.18799 ms
[05/21/2021-09:59:29] [I] max: 5.26367 ms
[05/21/2021-09:59:29] [I] median: 4.59708 ms
[05/21/2021-09:59:29] [I] GPU Compute
[05/21/2021-09:59:29] [I] min: 638.401 ms
[05/21/2021-09:59:29] [I] max: 646.992 ms
[05/21/2021-09:59:29] [I] mean: 643.169 ms
[05/21/2021-09:59:29] [I] median: 643.202 ms
[05/21/2021-09:59:29] [I] percentile: 646.992 ms at 99%
[05/21/2021-09:59:29] [I] total compute time: 6.43169 s
[05/21/2021-09:59:36] [I]
[05/21/2021-09:59:36] [I] === Profile (11 iterations ) ===
[05/21/2021-09:59:36] [I] Layer Time (ms) Avg. Time (ms) Time %
[05/21/2021-09:59:36] [I] Conv_0 + Relu_1 input reformatter 0 6.78 0.6160 0.1
[05/21/2021-09:59:36] [I] Conv_0 + Relu_1 172.72 15.7019 2.4
[05/21/2021-09:59:36] [I] Conv_2 + Relu_3 146.27 13.2970 2.1
[05/21/2021-09:59:36] [I] Conv_4 + Relu_5 624.73 56.7937 8.8
[05/21/2021-09:59:36] [I] Conv_6 + Relu_7 230.88 20.9890 3.2
[05/21/2021-09:59:36] [I] Conv_8 + Relu_9 646.51 58.7733 9.1
[05/21/2021-09:59:36] [I] Conv_10 + Add_11 + Relu_12 254.60 23.1453 3.6
[05/21/2021-09:59:36] [I] Conv_13 + Relu_14 646.50 58.7731 9.1
[05/21/2021-09:59:36] [I] Conv_15 + Relu_16 231.37 21.0339 3.3
[05/21/2021-09:59:36] [I] Conv_17 + Relu_18 644.87 58.6249 9.1
[05/21/2021-09:59:36] [I] Conv_19 + Add_20 + Relu_21 254.10 23.0998 3.6
[05/21/2021-09:59:36] [I] Conv_19 + Add_20 + Relu_21 output reformatter 0 30.56 2.7778 0.4
[05/21/2021-09:59:36] [I] Conv_22 + Relu_23 347.64 31.6034 4.9
[05/21/2021-09:59:36] [I] Conv_24 + Relu_25 79.01 7.1826 1.1
[05/21/2021-09:59:36] [I] Conv_26 + Relu_27 315.81 28.7103 4.4
[05/21/2021-09:59:36] [I] Conv_28 71.34 6.4858 1.0
[05/21/2021-09:59:36] [I] Conv_29 + Add_30 + Relu_31 40.39 3.6723 0.6
[05/21/2021-09:59:36] [I] Conv_32 + Relu_33 461.13 41.9213 6.5
[05/21/2021-09:59:36] [I] Conv_34 + Relu_35 90.01 8.1824 1.3
[05/21/2021-09:59:36] [I] Conv_36 + Relu_37 457.58 41.5979 6.4
[05/21/2021-09:59:36] [I] Conv_38 + Add_39 + Relu_40 100.24 9.1126 1.4
[05/21/2021-09:59:36] [I] Conv_38 + Add_39 + Relu_40 output reformatter 0 8.65 0.7863 0.1
[05/21/2021-09:59:36] [I] Conv_41 + Relu_42 160.36 14.5785 2.3
[05/21/2021-09:59:36] [I] Conv_43 + Relu_44 28.89 2.6267 0.4
[05/21/2021-09:59:36] [I] Conv_45 + Relu_46 129.67 11.7878 1.8
[05/21/2021-09:59:36] [I] Conv_47 38.74 3.5214 0.5
[05/21/2021-09:59:36] [I] Conv_48 + Add_49 + Relu_50 9.78 0.8890 0.1
[05/21/2021-09:59:36] [I] Conv_51 + Relu_52 188.34 17.1221 2.6
[05/21/2021-09:59:36] [I] Conv_53 + Relu_54 47.58 4.3254 0.7
[05/21/2021-09:59:36] [I] Conv_55 + Relu_56 187.94 17.0853 2.6
[05/21/2021-09:59:36] [I] Conv_57 + Add_58 + Relu_59 49.01 4.4556 0.7
[05/21/2021-09:59:36] [I] Conv_60 + Relu_61 input reformatter 0 1.96 0.1779 0.0
[05/21/2021-09:59:36] [I] Conv_60 + Relu_61 45.18 4.1073 0.6
[05/21/2021-09:59:36] [I] Conv_62 + Relu_63 input reformatter 0 1.93 0.1752 0.0
[05/21/2021-09:59:36] [I] Conv_62 + Relu_63 18.87 1.7156 0.3
[05/21/2021-09:59:36] [I] Conv_64 + Relu_65 input reformatter 0 0.59 0.0536 0.0
[05/21/2021-09:59:36] [I] Conv_64 + Relu_65 79.72 7.2476 1.1
[05/21/2021-09:59:36] [I] Conv_66 input reformatter 0 1.10 0.1000 0.0
[05/21/2021-09:59:36] [I] Conv_66 18.49 1.6807 0.3
[05/21/2021-09:59:36] [I] Conv_67 + Add_68 + Relu_69 3.82 0.3477 0.1
[05/21/2021-09:59:36] [I] Conv_70 + Relu_71 input reformatter 0 0.59 0.0538 0.0
[05/21/2021-09:59:36] [I] Conv_70 + Relu_71 93.10 8.4637 1.3
[05/21/2021-09:59:36] [I] Conv_72 + Relu_73 input reformatter 0 1.36 0.1233 0.0
[05/21/2021-09:59:36] [I] Conv_72 + Relu_73 22.65 2.0595 0.3
[05/21/2021-09:59:36] [I] Conv_74 + Relu_75 input reformatter 0 0.59 0.0534 0.0
[05/21/2021-09:59:36] [I] Conv_74 + Relu_75 92.77 8.4334 1.3
[05/21/2021-09:59:36] [I] Conv_76 + Add_77 + Relu_78 input reformatter 0 1.36 0.1234 0.0
[05/21/2021-09:59:36] [I] Conv_76 + Add_77 + Relu_78 23.08 2.0984 0.3
[05/21/2021-09:59:36] [I] GlobalAveragePool_79 1.12 0.1021 0.0
[05/21/2021-09:59:36] [I] Flatten_80 + (Unnamed Layer* 81) [Shuffle] input reformatter 0 0.08 0.0073 0.0
[05/21/2021-09:59:36] [I] Gemm_81 0.16 0.0148 0.0
[05/21/2021-09:59:36] [I] Total 7110.52 646.4112 100.0
[05/21/2021-09:59:36] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --du
mpProfile --separateProfileRun
So about -13% regression.
Hello @hyperfraise , could you take a try again on 21.05 without --best
? thanks
Hi, sure ! On 21.05, here my results :
Without --best, here is my output :
root@6b4242bad94e:/workspace# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfil
e --separateProfileRun
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProf
ile --separateProfileRun
[06/02/2021-14:59:02] [I] === Model Options ===
[06/02/2021-14:59:02] [I] Format: ONNX
[06/02/2021-14:59:02] [I] Model: resnet.onnx
[06/02/2021-14:59:02] [I] Output:
[06/02/2021-14:59:02] [I] === Build Options ===
[06/02/2021-14:59:02] [I] Max batch: explicit
[06/02/2021-14:59:02] [I] Workspace: 5000 MiB
[06/02/2021-14:59:02] [I] minTiming: 1
[06/02/2021-14:59:02] [I] avgTiming: 8
[06/02/2021-14:59:02] [I] Precision: FP32
[06/02/2021-14:59:02] [I] Calibration:
[06/02/2021-14:59:02] [I] Refit: Disabled
[06/02/2021-14:59:02] [I] Safe mode: Disabled
[06/02/2021-14:59:02] [I] Save engine: resnet.trt
[06/02/2021-14:59:02] [I] Load engine:
[06/02/2021-14:59:02] [I] Builder Cache: Enabled
[06/02/2021-14:59:02] [I] NVTX verbosity: 0
[06/02/2021-14:59:02] [I] Tactic sources: Using default tactic sources
[06/02/2021-14:59:02] [I] Input(s): fp32:chw
[06/02/2021-14:59:02] [I] Output(s): fp32:chw
[06/02/2021-14:59:02] [I] Input build shapes: model
[06/02/2021-14:59:02] [I] Input calibration shapes: model
[06/02/2021-14:59:02] [I] === System Options ===
[06/02/2021-14:59:02] [I] Device: 0
[06/02/2021-14:59:02] [I] DLACore:
[06/02/2021-14:59:02] [I] Plugins:
[06/02/2021-14:59:02] [I] === Inference Options ===
[06/02/2021-14:59:02] [I] Batch: Explicit
[06/02/2021-14:59:02] [I] Input inference shapes: model
[06/02/2021-14:59:02] [I] Iterations: 10
[06/02/2021-14:59:02] [I] Duration: 3s (+ 200ms warm up)
[06/02/2021-14:59:02] [I] Sleep time: 0ms
[06/02/2021-14:59:02] [I] Streams: 1
[06/02/2021-14:59:02] [I] ExposeDMA: Disabled
[06/02/2021-14:59:02] [I] Data transfers: Disabled
[06/02/2021-14:59:02] [I] Spin-wait: Disabled
[06/02/2021-14:59:02] [I] Multithreading: Disabled
[06/02/2021-14:59:02] [I] CUDA Graph: Disabled
[06/02/2021-14:59:02] [I] Separate profiling: Enabled
[06/02/2021-14:59:02] [I] Skip inference: Disabled
[06/02/2021-14:59:02] [I] Inputs:
[06/02/2021-14:59:02] [I] === Reporting Options ===
[06/02/2021-14:59:02] [I] Verbose: Disabled
[06/02/2021-14:59:02] [I] Averages: 10 inferences
[06/02/2021-14:59:02] [I] Percentile: 99
[06/02/2021-14:59:02] [I] Dump refittable layers:Disabled
[06/02/2021-14:59:02] [I] Dump output: Disabled
[06/02/2021-14:59:02] [I] Profile: Enabled
[06/02/2021-14:59:02] [I] Export timing to JSON file:
[06/02/2021-14:59:02] [I] Export output to JSON file:
[06/02/2021-14:59:02] [I] Export profile to JSON file:
[06/02/2021-14:59:02] [I]
[06/02/2021-14:59:02] [I] === Device Information ===
[06/02/2021-14:59:02] [I] Selected Device: NVIDIA GeForce GTX 1080 Ti
[06/02/2021-14:59:02] [I] Compute Capability: 6.1
[06/02/2021-14:59:02] [I] SMs: 28
[06/02/2021-14:59:02] [I] Compute Clock Rate: 1.62 GHz
[06/02/2021-14:59:02] [I] Device Global Memory: 11177 MiB
[06/02/2021-14:59:02] [I] Shared Memory per SM: 96 KiB
[06/02/2021-14:59:02] [I] Memory Bus Width: 352 bits (ECC disabled)
[06/02/2021-14:59:02] [I] Memory Clock Rate: 5.505 GHz
[06/02/2021-14:59:02] [I]
[06/02/2021-14:59:12] [I] [TRT] ----------------------------------------------------------------
[06/02/2021-14:59:12] [I] [TRT] Input filename: resnet.onnx
[06/02/2021-14:59:12] [I] [TRT] ONNX IR version: 0.0.6
[06/02/2021-14:59:12] [I] [TRT] Opset version: 9
[06/02/2021-14:59:12] [I] [TRT] Producer name: pytorch
[06/02/2021-14:59:12] [I] [TRT] Producer version: 1.8
[06/02/2021-14:59:12] [I] [TRT] Domain:
[06/02/2021-14:59:12] [I] [TRT] Model version: 0
[06/02/2021-14:59:12] [I] [TRT] Doc string:
[06/02/2021-14:59:12] [I] [TRT] ----------------------------------------------------------------
[06/02/2021-15:00:22] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[06/02/2021-15:00:26] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[06/02/2021-15:00:27] [I] Engine built in 85.6795 sec.
[06/02/2021-15:00:27] [I] Starting inference
[06/02/2021-15:00:36] [I] Warmup completed 0 queries over 200 ms
[06/02/2021-15:00:36] [I] Timing trace has 0 queries over 7.97772 s
[06/02/2021-15:00:36] [I] Trace averages of 10 runs:
[06/02/2021-15:00:36] [I] Average on 10 runs - GPU latency: 797.77 ms - Host latency: 797.77 ms (end to end 797.77 ms, enqueue 3.45671 ms)
[06/02/2021-15:00:36] [I] Host Latency
[06/02/2021-15:00:36] [I] min: 768.763 ms (end to end 768.763 ms)
[06/02/2021-15:00:36] [I] max: 826.762 ms (end to end 826.762 ms)
[06/02/2021-15:00:36] [I] mean: 797.77 ms (end to end 797.77 ms)
[06/02/2021-15:00:36] [I] median: 788.8 ms (end to end 788.8 ms)
[06/02/2021-15:00:36] [I] percentile: 826.762 ms at 99% (end to end 826.762 ms at 99%)
[06/02/2021-15:00:36] [I] throughput: 0 qps
[06/02/2021-15:00:36] [I] walltime: 7.97772 s
[06/02/2021-15:00:36] [I] Enqueue Time
[06/02/2021-15:00:36] [I] min: 3.01349 ms
[06/02/2021-15:00:36] [I] max: 3.8457 ms
[06/02/2021-15:00:36] [I] median: 3.46265 ms
[06/02/2021-15:00:36] [I] GPU Compute
[06/02/2021-15:00:36] [I] min: 768.763 ms
[06/02/2021-15:00:36] [I] max: 826.762 ms
[06/02/2021-15:00:36] [I] mean: 797.77 ms
[06/02/2021-15:00:36] [I] median: 788.8 ms
[06/02/2021-15:00:36] [I] percentile: 826.762 ms at 99%
[06/02/2021-15:00:36] [I] total compute time: 7.9777 s
[06/02/2021-15:00:45] [I]
[06/02/2021-15:00:45] [I] === Profile (11 iterations ) ===
[06/02/2021-15:00:45] [I] Layer Time (ms) Avg. Time (ms) Time %
[06/02/2021-15:00:45] [I] Conv_0 + Relu_1 193.32 17.5748 2.1
[06/02/2021-15:00:45] [I] Conv_2 + Relu_3 223.45 20.3133 2.5
[06/02/2021-15:00:45] [I] Conv_4 + Relu_5 705.83 64.1662 7.8
[06/02/2021-15:00:45] [I] Conv_6 + Relu_7 522.64 47.5131 5.8
[06/02/2021-15:00:45] [I] Conv_8 + Relu_9 706.56 64.2329 7.8
[06/02/2021-15:00:45] [I] Conv_10 + Add_11 + Relu_12 565.31 51.3922 6.3
[06/02/2021-15:00:45] [I] Conv_13 + Relu_14 706.37 64.2157 7.8
[06/02/2021-15:00:45] [I] Conv_15 + Relu_16 523.00 47.5453 5.8
[06/02/2021-15:00:45] [I] Conv_17 + Relu_18 707.66 64.3325 7.9
[06/02/2021-15:00:45] [I] Conv_19 + Add_20 + Relu_21 566.94 51.5397 6.3
[06/02/2021-15:00:45] [I] Conv_22 + Relu_23 363.44 33.0400 4.0
[06/02/2021-15:00:45] [I] Conv_24 + Relu_25 81.25 7.3860 0.9
[06/02/2021-15:00:45] [I] Conv_26 + Relu_27 333.62 30.3294 3.7
[06/02/2021-15:00:45] [I] Conv_28 74.46 6.7694 0.8
[06/02/2021-15:00:45] [I] Conv_29 + Add_30 + Relu_31 41.54 3.7762 0.5
[06/02/2021-15:00:45] [I] Conv_32 + Relu_33 489.90 44.5366 5.4
[06/02/2021-15:00:45] [I] Conv_34 + Relu_35 93.35 8.4865 1.0
[06/02/2021-15:00:45] [I] Conv_36 + Relu_37 487.77 44.3423 5.4
[06/02/2021-15:00:45] [I] Conv_38 + Add_39 + Relu_40 103.68 9.4258 1.2
[06/02/2021-15:00:45] [I] Conv_41 + Relu_42 264.03 24.0026 2.9
[06/02/2021-15:00:45] [I] Conv_43 + Relu_44 32.95 2.9950 0.4
[06/02/2021-15:00:45] [I] Conv_45 + Relu_46 135.13 12.2848 1.5
[06/02/2021-15:00:45] [I] Conv_47 50.09 4.5535 0.6
[06/02/2021-15:00:45] [I] Conv_48 + Add_49 + Relu_50 17.67 1.6059 0.2
[06/02/2021-15:00:45] [I] Conv_51 + Relu_52 197.70 17.9731 2.2
[06/02/2021-15:00:45] [I] Conv_53 + Relu_54 62.66 5.6960 0.7
[06/02/2021-15:00:45] [I] Conv_55 + Relu_56 197.99 17.9987 2.2
[06/02/2021-15:00:45] [I] Conv_57 + Add_58 + Relu_59 65.34 5.9404 0.7
[06/02/2021-15:00:45] [I] Conv_60 + Relu_61 48.82 4.4385 0.5
[06/02/2021-15:00:45] [I] Conv_62 + Relu_63 33.37 3.0339 0.4
[06/02/2021-15:00:45] [I] Conv_64 + Relu_65 86.51 7.8649 1.0
[06/02/2021-15:00:45] [I] Conv_66 32.86 2.9876 0.4
[06/02/2021-15:00:45] [I] Conv_67 + Add_68 + Relu_69 5.36 0.4872 0.1
[06/02/2021-15:00:45] [I] Conv_70 + Relu_71 100.95 9.1773 1.1
[06/02/2021-15:00:45] [I] Conv_72 + Relu_73 39.48 3.5889 0.4
[06/02/2021-15:00:45] [I] Conv_74 + Relu_75 101.06 9.1873 1.1
[06/02/2021-15:00:45] [I] Conv_76 + Add_77 + Relu_78 40.85 3.7139 0.5
[06/02/2021-15:00:45] [I] GlobalAveragePool_79 1.60 0.1451 0.0
[06/02/2021-15:00:45] [I] Gemm_81 0.17 0.0153 0.0
[06/02/2021-15:00:45] [I] Total 9004.69 818.6078 100.0
[06/02/2021-15:00:45] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
This is on another machine, so I feel compelled to add the results with --best as well on this machine :
root@6b4242bad94e:/workspace# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dum
pProfile --separateProfileRun
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --d
umpProfile --separateProfileRun
[06/02/2021-14:51:33] [I] === Model Options ===
[06/02/2021-14:51:33] [I] Format: ONNX
[06/02/2021-14:51:33] [I] Model: resnet.onnx
[06/02/2021-14:51:33] [I] Output:
[06/02/2021-14:51:33] [I] === Build Options ===
[06/02/2021-14:51:33] [I] Max batch: explicit
[06/02/2021-14:51:33] [I] Workspace: 5000 MiB
[06/02/2021-14:51:33] [I] minTiming: 1
[06/02/2021-14:51:33] [I] avgTiming: 8
[06/02/2021-14:51:33] [I] Precision: FP32+FP16+INT8
[06/02/2021-14:51:33] [I] Calibration: Dynamic
[06/02/2021-14:51:33] [I] Refit: Disabled
[06/02/2021-14:51:33] [I] Safe mode: Disabled
[06/02/2021-14:51:33] [I] Save engine: resnet.trt
[06/02/2021-14:51:33] [I] Load engine:
[06/02/2021-14:51:33] [I] Builder Cache: Enabled
[06/02/2021-14:51:33] [I] NVTX verbosity: 0
[06/02/2021-14:51:33] [I] Tactic sources: Using default tactic sources
[06/02/2021-14:51:33] [I] Input(s): fp32:chw
[06/02/2021-14:51:33] [I] Output(s): fp32:chw
[06/02/2021-14:51:33] [I] Input build shapes: model
[06/02/2021-14:51:33] [I] Input calibration shapes: model
[06/02/2021-14:51:33] [I] === System Options ===
[06/02/2021-14:51:33] [I] Device: 0
[06/02/2021-14:51:33] [I] DLACore:
[06/02/2021-14:51:33] [I] Plugins:
[06/02/2021-14:51:33] [I] === Inference Options ===
[06/02/2021-14:51:33] [I] Batch: Explicit
[06/02/2021-14:51:33] [I] Input inference shapes: model
[06/02/2021-14:51:33] [I] Iterations: 10
[06/02/2021-14:51:33] [I] Duration: 3s (+ 200ms warm up)
[06/02/2021-14:51:33] [I] Sleep time: 0ms
[06/02/2021-14:51:33] [I] Streams: 1
[06/02/2021-14:51:33] [I] ExposeDMA: Disabled
[06/02/2021-14:51:33] [I] Data transfers: Disabled
[06/02/2021-14:51:33] [I] Spin-wait: Disabled
[06/02/2021-14:51:33] [I] Multithreading: Disabled
[06/02/2021-14:51:33] [I] CUDA Graph: Disabled
[06/02/2021-14:51:33] [I] Separate profiling: Enabled
[06/02/2021-14:51:33] [I] Skip inference: Disabled
[06/02/2021-14:51:33] [I] Inputs:
[06/02/2021-14:51:33] [I] === Reporting Options ===
[06/02/2021-14:51:33] [I] Verbose: Disabled
[06/02/2021-14:51:33] [I] Averages: 10 inferences
[06/02/2021-14:51:33] [I] Percentile: 99
[06/02/2021-14:51:33] [I] Dump refittable layers:Disabled
[06/02/2021-14:51:33] [I] Dump output: Disabled
[06/02/2021-14:51:33] [I] Profile: Enabled
[06/02/2021-14:51:33] [I] Export timing to JSON file:
[06/02/2021-14:51:33] [I] Export output to JSON file:
[06/02/2021-14:51:33] [I] Export profile to JSON file:
[06/02/2021-14:51:33] [I]
[06/02/2021-14:51:33] [I] === Device Information ===
[06/02/2021-14:51:33] [I] Selected Device: NVIDIA GeForce GTX 1080 Ti [06/02/2021-14:51:33] [I] Compute Capability: 6.1
[06/02/2021-14:51:33] [I] SMs: 28
[06/02/2021-14:51:33] [I] Compute Clock Rate: 1.62 GHz
[06/02/2021-14:51:33] [I] Device Global Memory: 11177 MiB
[06/02/2021-14:51:33] [I] Shared Memory per SM: 96 KiB
[06/02/2021-14:51:33] [I] Memory Bus Width: 352 bits (ECC disabled)
[06/02/2021-14:51:33] [I] Memory Clock Rate: 5.505 GHz
[06/02/2021-14:51:33] [I]
[06/02/2021-14:51:43] [I] [TRT] ----------------------------------------------------------------
[06/02/2021-14:51:43] [I] [TRT] Input filename: resnet.onnx
[06/02/2021-14:51:43] [I] [TRT] ONNX IR version: 0.0.6
[06/02/2021-14:51:43] [I] [TRT] Opset version: 9
[06/02/2021-14:51:43] [I] [TRT] Producer name: pytorch
[06/02/2021-14:51:43] [I] [TRT] Producer version: 1.8
[06/02/2021-14:51:43] [I] [TRT] Domain:
[06/02/2021-14:51:43] [I] [TRT] Model version: 0
[06/02/2021-14:51:43] [I] [TRT] Doc string:
[06/02/2021-14:51:43] [I] [TRT] ----------------------------------------------------------------
[06/02/2021-14:51:43] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[06/02/2021-14:51:43] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[06/02/2021-14:54:05] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[06/02/2021-14:54:16] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[06/02/2021-14:54:17] [I] Engine built in 164.476 sec.
[06/02/2021-14:54:17] [I] Starting inference
[06/02/2021-14:54:25] [I] Warmup completed 0 queries over 200 ms
[06/02/2021-14:54:25] [I] Timing trace has 0 queries over 6.48338 s
[06/02/2021-14:54:25] [I] Trace averages of 10 runs:
[06/02/2021-14:54:25] [I] Average on 10 runs - GPU latency: 648.337 ms - Host latency: 648.337 ms (end to end 648.337 ms, enqueue 5.02236 ms)
[06/02/2021-14:54:25] [I] Host Latency
[06/02/2021-14:54:25] [I] min: 635.064 ms (end to end 635.064 ms)
[06/02/2021-14:54:25] [I] max: 661.95 ms (end to end 661.95 ms)
[06/02/2021-14:54:25] [I] mean: 648.337 ms (end to end 648.337 ms)
[06/02/2021-14:54:25] [I] median: 651.916 ms (end to end 651.916 ms)
[06/02/2021-14:54:25] [I] percentile: 661.95 ms at 99% (end to end 661.95 ms at 99%)
[06/02/2021-14:54:25] [I] throughput: 0 qps
[06/02/2021-14:54:25] [I] walltime: 6.48338 s
[06/02/2021-14:54:25] [I] Enqueue Time
[06/02/2021-14:54:25] [I] min: 3.5438 ms
[06/02/2021-14:54:25] [I] max: 5.24646 ms
[06/02/2021-14:54:25] [I] median: 5.19568 ms
[06/02/2021-14:54:25] [I] GPU Compute
[06/02/2021-14:54:25] [I] min: 635.064 ms
[06/02/2021-14:54:25] [I] max: 661.95 ms
[06/02/2021-14:54:25] [I] mean: 648.337 ms
[06/02/2021-14:54:25] [I] median: 651.916 ms
[06/02/2021-14:54:25] [I] percentile: 661.95 ms at 99%
[06/02/2021-14:54:25] [I] total compute time: 6.48337 s
[06/02/2021-14:54:32] [I]
[06/02/2021-14:54:32] [I] === Profile (11 iterations ) ===
[06/02/2021-14:54:32] [I] Layer Time (ms) Avg. Time (ms) Time %
[06/02/2021-14:54:32] [I] Conv_0 + Relu_1 input reformatter 0 6.78 0.6165 0.1
[06/02/2021-14:54:32] [I] Conv_0 + Relu_1 181.18 16.4711 2.5
[06/02/2021-14:54:32] [I] Conv_2 + Relu_3 149.82 13.6198 2.1
[06/02/2021-14:54:32] [I] Conv_4 + Relu_5 649.60 59.0542 8.9
[06/02/2021-14:54:32] [I] Conv_6 + Relu_7 232.50 21.1359 3.2
[06/02/2021-14:54:32] [I] Conv_8 + Relu_9 651.02 59.1834 9.0
[06/02/2021-14:54:32] [I] Conv_10 + Add_11 + Relu_12 256.36 23.3053 3.5
[06/02/2021-14:54:32] [I] Conv_13 + Relu_14 653.66 59.4238 9.0
[06/02/2021-14:54:32] [I] Conv_15 + Relu_16 233.98 21.2712 3.2
[06/02/2021-14:54:32] [I] Conv_17 + Relu_18 655.17 59.5611 9.0
[06/02/2021-14:54:32] [I] Conv_19 + Add_20 + Relu_21 257.12 23.3748 3.5
[06/02/2021-14:54:32] [I] Conv_19 + Add_20 + Relu_21 output reformatter 0 31.11 2.8283 0.4
[06/02/2021-14:54:32] [I] Conv_22 + Relu_23 353.59 32.1443 4.9
[06/02/2021-14:54:32] [I] Conv_24 + Relu_25 79.96 7.2688 1.1
[06/02/2021-14:54:32] [I] Conv_26 + Relu_27 323.08 29.3709 4.4
[06/02/2021-14:54:32] [I] Conv_28 72.87 6.6244 1.0
[06/02/2021-14:54:32] [I] Conv_29 + Add_30 + Relu_31 40.84 3.7125 0.6
[06/02/2021-14:54:32] [I] Conv_32 + Relu_33 472.22 42.9289 6.5
[06/02/2021-14:54:32] [I] Conv_34 + Relu_35 91.85 8.3500 1.3
[06/02/2021-14:54:32] [I] Conv_36 + Relu_37 471.57 42.8702 6.5
[06/02/2021-14:54:32] [I] Conv_38 + Add_39 + Relu_40 101.94 9.2671 1.4
[06/02/2021-14:54:32] [I] Conv_38 + Add_39 + Relu_40 output reformatter 0 8.77 0.7973 0.1
[06/02/2021-14:54:32] [I] Conv_41 + Relu_42 165.75 15.0680 2.3
[06/02/2021-14:54:32] [I] Conv_43 + Relu_44 30.03 2.7297 0.4
[06/02/2021-14:54:32] [I] Conv_45 + Relu_46 134.64 12.2402 1.9
[06/02/2021-14:54:32] [I] Conv_47 39.64 3.6035 0.5
[06/02/2021-14:54:32] [I] Conv_48 + Add_49 + Relu_50 9.98 0.9072 0.1
[06/02/2021-14:54:32] [I] Conv_51 + Relu_52 196.39 17.8537 2.7
[06/02/2021-14:54:32] [I] Conv_53 + Relu_54 48.94 4.4493 0.7
[06/02/2021-14:54:32] [I] Conv_55 + Relu_56 195.75 17.7952 2.7
[06/02/2021-14:54:32] [I] Conv_57 + Add_58 + Relu_59 50.34 4.5768 0.7
[06/02/2021-14:54:32] [I] Conv_60 + Relu_61 input reformatter 0 2.03 0.1850 0.0
[06/02/2021-14:54:32] [I] Conv_60 + Relu_61 46.81 4.2552 0.6
[06/02/2021-14:54:32] [I] Conv_62 + Relu_63 input reformatter 0 1.95 0.1775 0.0
[06/02/2021-14:54:32] [I] Conv_62 + Relu_63 19.64 1.7850 0.3
[06/02/2021-14:54:32] [I] Conv_64 + Relu_65 input reformatter 0 0.61 0.0557 0.0
[06/02/2021-14:54:32] [I] Conv_64 + Relu_65 83.39 7.5812 1.1
[06/02/2021-14:54:32] [I] Conv_66 input reformatter 0 1.12 0.1021 0.0
[06/02/2021-14:54:32] [I] Conv_66 19.26 1.7512 0.3
[06/02/2021-14:54:32] [I] Conv_67 + Add_68 + Relu_69 3.94 0.3586 0.1
[06/02/2021-14:54:32] [I] Conv_70 + Relu_71 input reformatter 0 0.61 0.0552 0.0
[06/02/2021-14:54:32] [I] Conv_70 + Relu_71 97.39 8.8534 1.3
[06/02/2021-14:54:32] [I] Conv_72 + Relu_73 input reformatter 0 1.38 0.1253 0.0
[06/02/2021-14:54:32] [I] Conv_72 + Relu_73 23.65 2.1500 0.3
[06/02/2021-14:54:32] [I] Conv_74 + Relu_75 input reformatter 0 0.61 0.0556 0.0
[06/02/2021-14:54:32] [I] Conv_74 + Relu_75 97.34 8.8488 1.3
[06/02/2021-14:54:32] [I] Conv_76 + Add_77 + Relu_78 input reformatter 0 1.38 0.1250 0.0
[06/02/2021-14:54:32] [I] Conv_76 + Add_77 + Relu_78 24.07 2.1878 0.3
[06/02/2021-14:54:32] [I] GlobalAveragePool_79 1.15 0.1047 0.0
[06/02/2021-14:54:32] [I] Flatten_80 + (Unnamed Layer* 81) [Shuffle] input reformatter 0 0.06 0.0056 0.0
[06/02/2021-14:54:32] [I] Gemm_81 0.17 0.0152 0.0
[06/02/2021-14:54:32] [I] Total 7273.00 661.1818 100.0
[06/02/2021-14:54:32] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
Here are the results with TensorRT 8. As you can see, it's still bad :/
I installed tensorrt manually with this file nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.0.3-ea-20210423_1-1_amd64.deb
on this docker image : nvcr.io/nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
&&&& RUNNING TensorRT.trtexec [TensorRT v8000] # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --no
DataTransfers --dumpProfile --separateProfileRun
[06/23/2021-13:23:23] [I] === Model Options ===
[06/23/2021-13:23:23] [I] Format: ONNX
[06/23/2021-13:23:23] [I] Model: resnet.onnx
[06/23/2021-13:23:23] [I] Output:
[06/23/2021-13:23:23] [I] === Build Options ===
[06/23/2021-13:23:23] [I] Max batch: explicit
[06/23/2021-13:23:23] [I] Workspace: 5000 MiB
[06/23/2021-13:23:23] [I] minTiming: 1
[06/23/2021-13:23:23] [I] avgTiming: 8
[06/23/2021-13:23:23] [I] Precision: FP32+FP16+INT8
[06/23/2021-13:23:23] [I] Calibration: Dynamic
[06/23/2021-13:23:23] [I] Refit: Disabled
[06/23/2021-13:23:23] [I] Sparsity: Disabled
[06/23/2021-13:23:23] [I] Safe mode: Disabled
[06/23/2021-13:23:23] [I] Enable serialization: Disabled
[06/23/2021-13:23:23] [I] Save engine: resnet.trt
[06/23/2021-13:23:23] [I] Load engine:
[06/23/2021-13:23:23] [I] NVTX verbosity: 0
[06/23/2021-13:23:23] [I] Tactic sources: Using default tactic sources
[06/23/2021-13:23:23] [I] timingCacheMode: local
[06/23/2021-13:23:23] [I] timingCacheFile:
[06/23/2021-13:23:23] [I] Input(s): fp32:chw
[06/23/2021-13:23:23] [I] Output(s): fp32:chw
[06/23/2021-13:23:23] [I] Input build shapes: model
[06/23/2021-13:23:23] [I] Input calibration shapes: model
[06/23/2021-13:23:23] [I] === System Options ===
[06/23/2021-13:23:23] [I] Device: 0
[06/23/2021-13:23:23] [I] DLACore:
[06/23/2021-13:23:23] [I] Plugins:
[06/23/2021-13:23:23] [I] === Inference Options ===
[06/23/2021-13:23:23] [I] Batch: Explicit
[06/23/2021-13:23:23] [I] Input inference shapes: model
[06/23/2021-13:23:23] [I] Iterations: 10
[06/23/2021-13:23:23] [I] Duration: 3s (+ 200ms warm up)
[06/23/2021-13:23:23] [I] Sleep time: 0ms
[06/23/2021-13:23:23] [I] Streams: 1
[06/23/2021-13:23:23] [I] ExposeDMA: Disabled
[06/23/2021-13:23:23] [I] Data transfers: Disabled
[06/23/2021-13:23:23] [I] Spin-wait: Disabled
[06/23/2021-13:23:23] [I] Multithreading: Disabled
[06/23/2021-13:23:23] [I] CUDA Graph: Disabled
[06/23/2021-13:23:23] [I] Separate profiling: Enabled
[06/23/2021-13:23:23] [I] Time Deserialize: Disabled
[06/23/2021-13:23:23] [I] Time Refit: Disabled
[06/23/2021-13:23:23] [I] Skip inference: Disabled
[06/23/2021-13:23:23] [I] Inputs:
[06/23/2021-13:23:23] [I] === Reporting Options ===
[06/23/2021-13:23:23] [I] Verbose: Disabled
[06/23/2021-13:23:23] [I] Averages: 10 inferences
[06/23/2021-13:23:23] [I] Percentile: 99
[06/23/2021-13:23:23] [I] Dump refittable layers:Disabled
[06/23/2021-13:23:23] [I] Dump output: Disabled
[06/23/2021-13:23:23] [I] Profile: Enabled
[06/23/2021-13:23:23] [I] Export timing to JSON file:
[06/23/2021-13:23:23] [I] Export output to JSON file:
[06/23/2021-13:23:23] [I] Export profile to JSON file:
[06/23/2021-13:23:23] [I]
[06/23/2021-13:23:23] [I] === Device Information ===
[06/23/2021-13:23:23] [I] Selected Device: GeForce GTX 1080 Ti
[06/23/2021-13:23:23] [I] Compute Capability: 6.1
[06/23/2021-13:23:23] [I] SMs: 28
[06/23/2021-13:23:23] [I] Compute Clock Rate: 1.582 GHz
[06/23/2021-13:23:23] [I] Device Global Memory: 11176 MiB
[06/23/2021-13:23:23] [I] Shared Memory per SM: 96 KiB
[06/23/2021-13:23:23] [I] Memory Bus Width: 352 bits (ECC disabled)
[06/23/2021-13:23:23] [I] Memory Clock Rate: 5.505 GHz
[06/23/2021-13:23:23] [I]
[06/23/2021-13:23:23] [I] TensorRT version: 8000
[06/23/2021-13:23:24] [I] [TRT] [MemUsageChange] Init CUDA: CPU +159, GPU +0, now: CPU 165, GPU 215 (MiB)
[06/23/2021-13:23:24] [I] [TRT] ----------------------------------------------------------------
[06/23/2021-13:23:24] [I] [TRT] Input filename: resnet.onnx
[06/23/2021-13:23:24] [I] [TRT] ONNX IR version: 0.0.6
[06/23/2021-13:23:24] [I] [TRT] Opset version: 9
[06/23/2021-13:23:24] [I] [TRT] Producer name: pytorch
[06/23/2021-13:23:24] [I] [TRT] Producer version: 1.7
[06/23/2021-13:23:24] [I] [TRT] Domain:
[06/23/2021-13:23:24] [I] [TRT] Model version: 0
[06/23/2021-13:23:24] [I] [TRT] Doc string:
[06/23/2021-13:23:24] [I] [TRT] ----------------------------------------------------------------
[06/23/2021-13:23:24] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 245, GPU 215 (MiB)
[06/23/2021-13:23:24] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 245 MiB, GPU 215 MiB
[06/23/2021-13:23:24] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[06/23/2021-13:23:24] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[06/23/2021-13:23:24] [W] [TRT] Convolution + generic activation fusion is disable due to incompatible driver or nvrtc
[06/23/2021-13:23:24] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +233, GPU +94, now: CPU 479, GPU 309 (MiB)
[06/23/2021-13:23:24] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +189, GPU +84, now: CPU 668, GPU 393 (MiB)
[06/23/2021-13:23:24] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[06/23/2021-13:24:35] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[06/23/2021-13:24:40] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[06/23/2021-13:24:40] [I] [TRT] Total Host Persistent Memory: 1536
[06/23/2021-13:24:40] [I] [TRT] Total Device Persistent Memory: 0
[06/23/2021-13:24:40] [I] [TRT] Total Scratch Memory: 508851200
[06/23/2021-13:24:40] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 4 MiB
[06/23/2021-13:24:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 909, GPU 533 (MiB)
[06/23/2021-13:24:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 909, GPU 541 (MiB)
[06/23/2021-13:24:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 909, GPU 525 (MiB)
[06/23/2021-13:24:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 908, GPU 507 (MiB)
[06/23/2021-13:24:40] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 908 MiB, GPU 507 MiB
[06/23/2021-13:24:41] [I] Engine built in 77.2374 sec.
[06/23/2021-13:24:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 827, GPU 517 (MiB)
[06/23/2021-13:24:41] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 827, GPU 525 (MiB)
[06/23/2021-13:24:41] [I] Created input binding for 0 with dimensions 4x3x35x224x224
[06/23/2021-13:24:41] [I] Created output binding for 343 with dimensions 4x400
[06/23/2021-13:24:41] [I] Starting inference
[06/23/2021-13:24:48] [I] Warmup completed 1 queries over 200 ms
[06/23/2021-13:24:48] [I] Timing trace has 10 queries over 6.36198 s
[06/23/2021-13:24:48] [I]
[06/23/2021-13:24:48] [I] === Trace details ===
[06/23/2021-13:24:48] [I] Trace averages of 10 runs:
[06/23/2021-13:24:48] [I] Average on 10 runs - GPU latency: 636.196 ms - Host latency: 636.196 ms (end to end 636.196 ms, enqueue 2.68777 ms)
[06/23/2021-13:24:48] [I]
[06/23/2021-13:24:48] [I] === Performance summary ===
[06/23/2021-13:24:48] [I] Throughput: 1.57184 qps
[06/23/2021-13:24:48] [I] Latency: min = 634.644 ms, max = 638.123 ms, mean = 636.196 ms, median = 636.003 ms, percentile(99%) = 638.123 ms
[06/23/2021-13:24:48] [I] End-to-End Host Latency: min = 634.644 ms, max = 638.123 ms, mean = 636.196 ms, median = 636.003 ms, percentile(99%) = 638.123 ms
[06/23/2021-13:24:48] [I] Enqueue Time: min = 2.53569 ms, max = 2.81494 ms, mean = 2.68777 ms, median = 2.74072 ms, percentile(99%) = 2.81494 ms
[06/23/2021-13:24:48] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[06/23/2021-13:24:48] [I] GPU Compute Time: min = 634.644 ms, max = 638.123 ms, mean = 636.196 ms, median = 636.003 ms, percentile(99%) = 638.123 ms
[06/23/2021-13:24:48] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[06/23/2021-13:24:48] [I] Total Host Walltime: 6.36198 s
[06/23/2021-13:24:48] [I] Total GPU Compute Time: 6.36196 s
[06/23/2021-13:24:48] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/23/2021-13:24:48] [I]
[06/23/2021-13:24:55] [I]
[06/23/2021-13:24:55] [I] === Profile (11 iterations ) ===
[06/23/2021-13:24:55] [I] Layer Time (ms) Avg. Time (ms) Time %
[06/23/2021-13:24:55] [I] Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1 7.29 0.6629 0.1
[06/23/2021-13:24:55] [I] Conv_0 + Relu_1 168.48 15.3162 2.4
[06/23/2021-13:24:55] [I] Conv_2 + Relu_3 142.70 12.9725 2.0
[06/23/2021-13:24:55] [I] Conv_4 + Relu_5 618.20 56.1999 8.8
[06/23/2021-13:24:55] [I] Conv_6 + Relu_7 230.13 20.9210 3.3
[06/23/2021-13:24:55] [I] Conv_8 + Relu_9 639.12 58.1021 9.1
[06/23/2021-13:24:55] [I] Conv_10 + Add_11 + Relu_12 251.36 22.8513 3.6
[06/23/2021-13:24:55] [I] Conv_13 + Relu_14 632.13 57.4666 9.0
[06/23/2021-13:24:55] [I] Conv_15 + Relu_16 227.75 20.7047 3.3
[06/23/2021-13:24:55] [I] Conv_17 + Relu_18 632.38 57.4888 9.0
[06/23/2021-13:24:55] [I] Conv_19 + Add_20 + Relu_21 250.15 22.7405 3.6
[06/23/2021-13:24:55] [I] Reformatting CopyNode for Output Tensor 0 to Conv_19 + Add_20 + Relu_21 29.86 2.7147 0.4
[06/23/2021-13:24:55] [I] Conv_22 + Relu_23 340.15 30.9225 4.9
[06/23/2021-13:24:55] [I] Conv_24 + Relu_25 78.56 7.1415 1.1
[06/23/2021-13:24:55] [I] Conv_26 + Relu_27 308.86 28.0779 4.4
[06/23/2021-13:24:55] [I] Conv_28 71.43 6.4938 1.0
[06/23/2021-13:24:55] [I] Conv_29 + Add_30 + Relu_31 39.99 3.6358 0.6
[06/23/2021-13:24:55] [I] Conv_32 + Relu_33 450.92 40.9929 6.4
[06/23/2021-13:24:55] [I] Conv_34 + Relu_35 89.53 8.1391 1.3
[06/23/2021-13:24:55] [I] Conv_36 + Relu_37 447.44 40.6764 6.4
[06/23/2021-13:24:55] [I] Conv_38 + Add_39 + Relu_40 98.98 8.9981 1.4
[06/23/2021-13:24:55] [I] Reformatting CopyNode for Output Tensor 0 to Conv_38 + Add_39 + Relu_40 9.82 0.8929 0.1
[06/23/2021-13:24:55] [I] Conv_41 + Relu_42 156.82 14.2564 2.2
[06/23/2021-13:24:55] [I] Conv_43 + Relu_44 28.42 2.5833 0.4
[06/23/2021-13:24:55] [I] Conv_45 + Relu_46 127.12 11.5566 1.8
[06/23/2021-13:24:55] [I] Conv_47 38.15 3.4682 0.5
[06/23/2021-13:24:55] [I] Conv_48 + Add_49 + Relu_50 9.71 0.8830 0.1
[06/23/2021-13:24:55] [I] Conv_51 + Relu_52 185.15 16.8321 2.6
[06/23/2021-13:24:55] [I] Conv_53 + Relu_54 46.87 4.2614 0.7
[06/23/2021-13:24:55] [I] Conv_55 + Relu_56 183.94 16.7219 2.6
[06/23/2021-13:24:55] [I] Conv_57 + Add_58 + Relu_59 48.34 4.3944 0.7
[06/23/2021-13:24:55] [I] Reformatting CopyNode for Input Tensor 0 to Conv_60 + Relu_61 1.92 0.1742 0.0
[06/23/2021-13:24:55] [I] Conv_60 + Relu_61 44.42 4.0381 0.6
[06/23/2021-13:24:55] [I] Reformatting CopyNode for Input Tensor 0 to Conv_62 + Relu_63 2.42 0.2203 0.0
[06/23/2021-13:24:55] [I] Conv_62 + Relu_63 18.49 1.6812 0.3
[06/23/2021-13:24:55] [I] Conv_64 + Relu_65 80.67 7.3335 1.2
[06/23/2021-13:24:55] [I] Conv_66 18.08 1.6433 0.3
[06/23/2021-13:24:55] [I] Conv_67 + Add_68 + Relu_69 3.78 0.3439 0.1
[06/23/2021-13:24:55] [I] Conv_70 + Relu_71 93.57 8.5066 1.3
[06/23/2021-13:24:55] [I] Conv_72 + Relu_73 22.16 2.0149 0.3
[06/23/2021-13:24:55] [I] Conv_74 + Relu_75 93.51 8.5008 1.3
[06/23/2021-13:24:55] [I] Conv_76 + Add_77 + Relu_78 22.57 2.0520 0.3
[06/23/2021-13:24:55] [I] GlobalAveragePool_79 1.09 0.0995 0.0
[06/23/2021-13:24:55] [I] Reformatting CopyNode for Input Tensor 0 to Flatten_80 + (Unnamed Layer* 81) [Shuffle] 0.07 0.0061 0.0
[06/23/2021-13:24:55] [I] Gemm_81 0.17 0.0154 0.0
[06/23/2021-13:24:55] [I] Total 6992.69 635.6991 100.0
[06/23/2021-13:24:55] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8000] # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noD
ataTransfers --dumpProfile --separateProfileRun
[06/23/2021-13:24:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 827, GPU 1919 (MiB)
And without --best
:
/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile
--separateProfileRun
&&&& RUNNING TensorRT.trtexec [TensorRT v8000] # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTra
nsfers --dumpProfile --separateProfileRun
[06/23/2021-13:25:38] [I] === Model Options ===
[06/23/2021-13:25:38] [I] Format: ONNX
[06/23/2021-13:25:38] [I] Model: resnet.onnx
[06/23/2021-13:25:38] [I] Output:
[06/23/2021-13:25:38] [I] === Build Options ===
[06/23/2021-13:25:38] [I] Max batch: explicit
[06/23/2021-13:25:38] [I] Workspace: 5000 MiB
[06/23/2021-13:25:38] [I] minTiming: 1
[06/23/2021-13:25:38] [I] avgTiming: 8
[06/23/2021-13:25:38] [I] Precision: FP32
[06/23/2021-13:25:38] [I] Calibration:
[06/23/2021-13:25:38] [I] Refit: Disabled
[06/23/2021-13:25:38] [I] Sparsity: Disabled
[06/23/2021-13:25:38] [I] Safe mode: Disabled
[06/23/2021-13:25:38] [I] Enable serialization: Disabled
[06/23/2021-13:25:38] [I] Save engine: resnet.trt
[06/23/2021-13:25:38] [I] Load engine:
[06/23/2021-13:25:38] [I] NVTX verbosity: 0
[06/23/2021-13:25:38] [I] Tactic sources: Using default tactic sources
[06/23/2021-13:25:38] [I] timingCacheMode: local
[06/23/2021-13:25:38] [I] timingCacheFile:
[06/23/2021-13:25:38] [I] Input(s): fp32:chw
[06/23/2021-13:25:38] [I] Output(s): fp32:chw
[06/23/2021-13:25:38] [I] Input build shapes: model
[06/23/2021-13:25:38] [I] Input calibration shapes: model
[06/23/2021-13:25:38] [I] === System Options ===
[06/23/2021-13:25:38] [I] Device: 0
[06/23/2021-13:25:38] [I] DLACore:
[06/23/2021-13:25:38] [I] Plugins:
[06/23/2021-13:25:38] [I] === Inference Options ===
[06/23/2021-13:25:38] [I] Batch: Explicit
[06/23/2021-13:25:38] [I] Input inference shapes: model
[06/23/2021-13:25:38] [I] Iterations: 10
[06/23/2021-13:25:38] [I] Duration: 3s (+ 200ms warm up)
[06/23/2021-13:25:38] [I] Sleep time: 0ms
[06/23/2021-13:25:38] [I] Streams: 1
[06/23/2021-13:25:38] [I] ExposeDMA: Disabled
[06/23/2021-13:25:38] [I] Data transfers: Disabled
[06/23/2021-13:25:38] [I] Spin-wait: Disabled
[06/23/2021-13:25:38] [I] Multithreading: Disabled
[06/23/2021-13:25:38] [I] CUDA Graph: Disabled
[06/23/2021-13:25:38] [I] Separate profiling: Enabled
[06/23/2021-13:25:38] [I] Time Deserialize: Disabled
[06/23/2021-13:25:38] [I] Time Refit: Disabled
[06/23/2021-13:25:38] [I] Skip inference: Disabled
[06/23/2021-13:25:38] [I] Inputs:
[06/23/2021-13:25:38] [I] === Reporting Options ===
[06/23/2021-13:25:38] [I] Verbose: Disabled
[06/23/2021-13:25:38] [I] Averages: 10 inferences
[06/23/2021-13:25:38] [I] Percentile: 99
[06/23/2021-13:25:38] [I] Dump refittable layers:Disabled
[06/23/2021-13:25:38] [I] Dump output: Disabled
[06/23/2021-13:25:38] [I] Profile: Enabled
[06/23/2021-13:25:38] [I] Export timing to JSON file:
[06/23/2021-13:25:38] [I] Export output to JSON file:
[06/23/2021-13:25:38] [I] Export profile to JSON file:
[06/23/2021-13:25:38] [I]
[06/23/2021-13:25:38] [I] === Device Information ===
[06/23/2021-13:25:38] [I] Selected Device: GeForce GTX 1080 Ti
[06/23/2021-13:25:38] [I] Compute Capability: 6.1
[06/23/2021-13:25:38] [I] SMs: 28
[06/23/2021-13:25:38] [I] Compute Clock Rate: 1.582 GHz
[06/23/2021-13:25:38] [I] Device Global Memory: 11176 MiB
[06/23/2021-13:25:38] [I] Shared Memory per SM: 96 KiB
[06/23/2021-13:25:38] [I] Memory Bus Width: 352 bits (ECC disabled)
[06/23/2021-13:25:38] [I] Memory Clock Rate: 5.505 GHz
[06/23/2021-13:25:38] [I]
[06/23/2021-13:25:38] [I] TensorRT version: 8000
[06/23/2021-13:25:38] [I] [TRT] [MemUsageChange] Init CUDA: CPU +159, GPU +0, now: CPU 165, GPU 215 (MiB)
[06/23/2021-13:25:38] [I] [TRT] ----------------------------------------------------------------
[06/23/2021-13:25:38] [I] [TRT] Input filename: resnet.onnx
[06/23/2021-13:25:38] [I] [TRT] ONNX IR version: 0.0.6
[06/23/2021-13:25:38] [I] [TRT] Opset version: 9
[06/23/2021-13:25:38] [I] [TRT] Producer name: pytorch
[06/23/2021-13:25:38] [I] [TRT] Producer version: 1.7
[06/23/2021-13:25:38] [I] [TRT] Domain:
[06/23/2021-13:25:38] [I] [TRT] Model version: 0
[06/23/2021-13:25:38] [I] [TRT] Doc string:
[06/23/2021-13:25:38] [I] [TRT] ----------------------------------------------------------------
[06/23/2021-13:25:38] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 245, GPU 215 (MiB)
[06/23/2021-13:25:38] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 245 MiB, GPU 215 MiB
[06/23/2021-13:25:38] [W] [TRT] Convolution + generic activation fusion is disable due to incompatible driver or nvrtc
[06/23/2021-13:25:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +234, GPU +94, now: CPU 479, GPU 309 (MiB)
[06/23/2021-13:25:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +188, GPU +84, now: CPU 667, GPU 393 (MiB)
[06/23/2021-13:25:39] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[06/23/2021-13:26:14] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[06/23/2021-13:26:16] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[06/23/2021-13:26:16] [I] [TRT] Total Host Persistent Memory: 1536
[06/23/2021-13:26:16] [I] [TRT] Total Device Persistent Memory: 0
[06/23/2021-13:26:16] [I] [TRT] Total Scratch Memory: 1014681600
[06/23/2021-13:26:16] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[06/23/2021-13:26:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 855, GPU 587 (MiB)
[06/23/2021-13:26:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 855, GPU 595 (MiB)
[06/23/2021-13:26:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 855, GPU 579 (MiB)
[06/23/2021-13:26:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 855, GPU 561 (MiB)
[06/23/2021-13:26:16] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 855 MiB, GPU 561 MiB
[06/23/2021-13:26:17] [I] Engine built in 38.7099 sec.
[06/23/2021-13:26:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 774, GPU 571 (MiB)
[06/23/2021-13:26:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 775, GPU 579 (MiB)
[06/23/2021-13:26:17] [I] Created input binding for 0 with dimensions 4x3x35x224x224
[06/23/2021-13:26:17] [I] Created output binding for 343 with dimensions 4x400
[06/23/2021-13:26:17] [I] Starting inference
[06/23/2021-13:26:25] [I] Warmup completed 1 queries over 200 ms
[06/23/2021-13:26:25] [I] Timing trace has 10 queries over 7.64667 s
[06/23/2021-13:26:25] [I]
[06/23/2021-13:26:25] [I] === Trace details ===
[06/23/2021-13:26:25] [I] Trace averages of 10 runs:
[06/23/2021-13:26:25] [I] Average on 10 runs - GPU latency: 764.665 ms - Host latency: 764.665 ms (end to end 764.665 ms, enqueue 1.9593 ms)
[06/23/2021-13:26:25] [I]
[06/23/2021-13:26:25] [I] === Performance summary ===
[06/23/2021-13:26:25] [I] Throughput: 1.30776 qps
[06/23/2021-13:26:25] [I] Latency: min = 759.987 ms, max = 767.528 ms, mean = 764.665 ms, median = 764.687 ms, percentile(99%) = 767.528 ms
[06/23/2021-13:26:25] [I] End-to-End Host Latency: min = 759.987 ms, max = 767.528 ms, mean = 764.665 ms, median = 764.687 ms, percentile(99%) = 767.528 ms
[06/23/2021-13:26:25] [I] Enqueue Time: min = 1.78426 ms, max = 2.03662 ms, mean = 1.9593 ms, median = 1.98743 ms, percentile(99%) = 2.03662 ms
[06/23/2021-13:26:25] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[06/23/2021-13:26:25] [I] GPU Compute Time: min = 759.987 ms, max = 767.528 ms, mean = 764.665 ms, median = 764.687 ms, percentile(99%) = 767.528 ms
[06/23/2021-13:26:25] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[06/23/2021-13:26:25] [I] Total Host Walltime: 7.64667 s
[06/23/2021-13:26:25] [I] Total GPU Compute Time: 7.64665 s
[06/23/2021-13:26:25] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/23/2021-13:26:25] [I]
[06/23/2021-13:26:34] [I]
[06/23/2021-13:26:34] [I] === Profile (11 iterations ) ===
[06/23/2021-13:26:34] [I] Layer Time (ms) Avg. Time (ms) Time %
[06/23/2021-13:26:34] [I] Conv_0 + Relu_1 179.34 16.3033 2.1
[06/23/2021-13:26:34] [I] Conv_2 + Relu_3 208.35 18.9407 2.5
[06/23/2021-13:26:34] [I] Conv_4 + Relu_5 662.98 60.2711 7.9
[06/23/2021-13:26:34] [I] Conv_6 + Relu_7 496.22 45.1105 5.9
[06/23/2021-13:26:34] [I] Conv_8 + Relu_9 668.58 60.7802 7.9
[06/23/2021-13:26:34] [I] Conv_10 + Add_11 + Relu_12 533.84 48.5308 6.3
[06/23/2021-13:26:34] [I] Conv_13 + Relu_14 663.27 60.2971 7.9
[06/23/2021-13:26:34] [I] Conv_15 + Relu_16 494.74 44.9765 5.9
[06/23/2021-13:26:34] [I] Conv_8 + Relu_9 668.58 60.7802 7.9
[06/23/2021-13:26:34] [I] Conv_10 + Add_11 + Relu_12 533.84 48.5308 6.3
[06/23/2021-13:26:34] [I] Conv_13 + Relu_14 663.27 60.2971 7.9
[06/23/2021-13:26:34] [I] Conv_15 + Relu_16 494.74 44.9765 5.9
[06/23/2021-13:26:34] [I] Conv_17 + Relu_18 667.25 60.6595 7.9
[06/23/2021-13:26:34] [I] Conv_19 + Add_20 + Relu_21 531.30 48.2997 6.3
[06/23/2021-13:26:34] [I] Conv_22 + Relu_23 337.54 30.6856 4.0
[06/23/2021-13:26:34] [I] Conv_24 + Relu_25 78.31 7.1188 0.9
[06/23/2021-13:26:34] [I] Conv_26 + Relu_27 305.24 27.7490 3.6
[06/23/2021-13:26:34] [I] Conv_28 71.01 6.4558 0.8
[06/23/2021-13:26:34] [I] Conv_29 + Add_30 + Relu_31 39.64 3.6039 0.5
[06/23/2021-13:26:34] [I] Conv_32 + Relu_33 446.94 40.6308 5.3
[06/23/2021-13:26:34] [I] Conv_34 + Relu_35 88.57 8.0518 1.1
[06/23/2021-13:26:34] [I] Conv_36 + Relu_37 443.04 40.2766 5.3
[06/23/2021-13:26:34] [I] Conv_38 + Add_39 + Relu_40 99.77 9.0698 1.2
[06/23/2021-13:26:34] [I] Conv_41 + Relu_42 239.33 21.7572 2.8
[06/23/2021-13:26:34] [I] Conv_43 + Relu_44 30.69 2.7897 0.4
[06/23/2021-13:26:34] [I] Conv_45 + Relu_46 128.52 11.6840 1.5
[06/23/2021-13:26:34] [I] Conv_47 46.40 4.2186 0.6
[06/23/2021-13:26:34] [I] Conv_48 + Add_49 + Relu_50 17.43 1.5845 0.2
[06/23/2021-13:26:34] [I] Conv_51 + Relu_52 187.69 17.0626 2.2
[06/23/2021-13:26:34] [I] Conv_53 + Relu_54 59.29 5.3900 0.7
[06/23/2021-13:26:34] [I] Conv_55 + Relu_56 188.16 17.1051 2.2
[06/23/2021-13:26:34] [I] Conv_57 + Add_58 + Relu_59 62.02 5.6380 0.7
[06/23/2021-13:26:34] [I] Conv_60 + Relu_61 45.43 4.1304 0.5
[06/23/2021-13:26:34] [I] Conv_62 + Relu_63 30.29 2.7533 0.4
[06/23/2021-13:26:34] [I] Conv_64 + Relu_65 80.24 7.2947 1.0
[06/23/2021-13:26:34] [I] Conv_66 30.02 2.7289 0.4
[06/23/2021-13:26:34] [I] Conv_67 + Add_68 + Relu_69 5.18 0.4705 0.1
[06/23/2021-13:26:34] [I] Conv_70 + Relu_71 93.26 8.4783 1.1
[06/23/2021-13:26:34] [I] Conv_72 + Relu_73 36.31 3.3012 0.4
[06/23/2021-13:26:34] [I] Conv_74 + Relu_75 93.49 8.4995 1.1
[06/23/2021-13:26:34] [I] Conv_76 + Add_77 + Relu_78 37.64 3.4216 0.4
[06/23/2021-13:26:34] [I] GlobalAveragePool_79 1.58 0.1435 0.0
[06/23/2021-13:26:34] [I] Gemm_81 0.18 0.0160 0.0
[06/23/2021-13:26:34] [I] Total 8429.07 766.2787 100.0
[06/23/2021-13:26:34] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8000] # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
[06/23/2021-13:26:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 774, GPU 3463 (MiB)
Will that help you guys in figuring out this performance regression ?
I wonder if this is related to this. It's the same GPU, and the performance drop is comparable.
If it is, it would mean cudnn 8.2.1 solves the issue, see https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-821 line "Known regressions on certain layers in cuDNN 8 regression in algorithm selection heuristics have been fixed on Volta and Pascal platforms."
Since it's in the 21.06 image just released, I'll take a look.
@hyperfraise thanks for sharing and sorry for the delay response, I have created internal issue to track this regression.
I think I was right. I tested 21.06, which has cudnn 8.2.1, and the problem seems solved : 21.06 : 535.982 ms
20.11 : 563.797 ms
(so there's even a little boost)
Thus closing this.
Description
Going from 20.11 to 20.12 introduces performance regression on common 3D convolution model.
Environment
TensorRT Version: 7.2.1 -> 7.2.2 NVIDIA GPU: 1080 Ti NVIDIA Driver Version: 460 CUDA Version: 11.1.0 -> 11.1.1 CUDNN Version: 8.0.4 -> 8.0.5 Operating System: Ubuntu 20 Python Version (if applicable): 3.6 -> 3.8 PyTorch Version (if applicable): 1.8.1
Steps To Reproduce
To reproduce, save a 3d model with this script in onnx format :
Then optimize it in the TensorRT docker
My speedtest shows this results (speeds are in videos / s)
This ~10% regression continues with later versions.