NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.52k stars 2.1k forks source link

Performance regression on 1080 Ti #1221

Closed dmenig closed 3 years ago

dmenig commented 3 years ago

Description

Going from 20.11 to 20.12 introduces performance regression on common 3D convolution model.

Environment

TensorRT Version: 7.2.1 -> 7.2.2 NVIDIA GPU: 1080 Ti NVIDIA Driver Version: 460 CUDA Version: 11.1.0 -> 11.1.1 CUDNN Version: 8.0.4 -> 8.0.5 Operating System: Ubuntu 20 Python Version (if applicable): 3.6 -> 3.8 PyTorch Version (if applicable): 1.8.1

Steps To Reproduce

To reproduce, save a 3d model with this script in onnx format :

import torch
import torchvision

dummy_input = torch.randn(4, 3, 35, 224, 224).float().cuda()
model = torchvision.models.video.r2plus1d_18().cuda().eval()

with torch.no_grad():
    torch.onnx.export(
        model,
        dummy_input,
        "resnet.onnx",
        verbose=True,
    )

Then optimize it in the TensorRT docker

/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw

My speedtest shows this results (speeds are in videos / s)

20.11 
1080 Ti : 7.52

20.12
1080 Ti : 6.80

This ~10% regression continues with later versions.

dmenig commented 3 years ago

On those models, I see more of a -28% regression in speed.

ttyio commented 3 years ago

Hello @hyperfraise , could you provide the perf number again using trtexec with option --noDataTransfers --dumpProfile --separateProfiling, and attach both log here? thanks!

dmenig commented 3 years ago

I think you meant --seperateProfileRun. I added those three options and here are the logs : 20.11 :

root@6ec22ad980d8:/veesion# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun                                                                                                                                                                                
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun                                                                                                                                                                            
[04/30/2021-08:51:33] [I] === Model Options ===                                                                                                                                                            
[04/30/2021-08:51:33] [I] Format: ONNX                                                                                                                                                                     
[04/30/2021-08:51:33] [I] Model: resnet.onnx                                                                                                                                                               
[04/30/2021-08:51:33] [I] Output:                                                                                                                                                                          
[04/30/2021-08:51:33] [I] === Build Options ===                                                                                                                                                            
[04/30/2021-08:51:33] [I] Max batch: explicit                                                                                                                                                              
[04/30/2021-08:51:33] [I] Workspace: 5000 MiB                                                                                                                                                              
[04/30/2021-08:51:33] [I] minTiming: 1                                                                                                                                                                     
[04/30/2021-08:51:33] [I] avgTiming: 8                                                                                                                                                                     
[04/30/2021-08:51:33] [I] Precision: FP32+FP16+INT8                                                                                                                                                        
[04/30/2021-08:51:33] [I] Calibration: Dynamic                                                                                                                                                             
[04/30/2021-08:51:33] [I] Refit: Disabled                                                                                                                                                                  
[04/30/2021-08:51:33] [I] Safe mode: Disabled                                                                                                                                                              
[04/30/2021-08:51:33] [I] Save engine: resnet.trt                                                                                                                                                          
[04/30/2021-08:51:33] [I] Load engine:                                                                                                                                                                     
[04/30/2021-08:51:33] [I] Builder Cache: Enabled                                                                                                                                                           
[04/30/2021-08:51:33] [I] NVTX verbosity: 0                                                                                                                                                                
[04/30/2021-08:51:33] [I] Tactic sources: Using default tactic sources                                                                                                                                     
[04/30/2021-08:51:33] [I] Input(s): fp32:chw                                                                                                                                                               
[04/30/2021-08:51:33] [I] Output(s): fp32:chw                                                                                                                                                              
[04/30/2021-08:51:33] [I] Input build shapes: model                                                                                                                                                        
[04/30/2021-08:51:33] [I] Input calibration shapes: model                                                                                                                                                  
[04/30/2021-08:51:33] [I] === System Options ===                                                                                                                                                           
[04/30/2021-08:51:33] [I] Device: 0                                                                                                                                                                        
[04/30/2021-08:51:33] [I] DLACore:                                                                                                                                                                         
[04/30/2021-08:51:33] [I] Plugins:                                                                                                                                                                         
[04/30/2021-08:51:33] [I] === Inference Options ===                                                                                                                                                        
[04/30/2021-08:51:33] [I] Batch: Explicit                                                                                                                                                                  
[04/30/2021-08:51:33] [I] Input inference shapes: model                                                                                                                                                    
[04/30/2021-08:51:33] [I] Iterations: 10                                                                                                                                                                   
[04/30/2021-08:51:33] [I] Duration: 3s (+ 200ms warm up)                                                                                                                                                   
[04/30/2021-08:51:33] [I] Sleep time: 0ms                                                                                                                                                                  
[04/30/2021-08:51:33] [I] Streams: 1                                                                                                                                                                       
[04/30/2021-08:51:33] [I] ExposeDMA: Disabled                                                                                                                                                              
[04/30/2021-08:51:33] [I] Data transfers: Disabled                                                                                                                                                         
[04/30/2021-08:51:33] [I] Spin-wait: Disabled                                                                                                                                                              
[04/30/2021-08:51:33] [I] Multithreading: Disabled                                                                                                                                                         
[04/30/2021-08:51:33] [I] CUDA Graph: Disabled                                                                                                                                                             
[04/30/2021-08:51:33] [I] Separate profiling: Enabled                                                                                                                                                      
[04/30/2021-08:51:33] [I] Skip inference: Disabled                                                                                                                                                         
[04/30/2021-08:51:33] [I] Inputs:                                                                                                                                                                          
[04/30/2021-08:51:33] [I] === Reporting Options ===                                                                                                                                                        
[04/30/2021-08:51:33] [I] Verbose: Disabled                                                                                                                                                                
[04/30/2021-08:51:33] [I] Averages: 10 inferences                                                                                                                                                          
[04/30/2021-08:51:33] [I] Percentile: 99                                                                                                                                                                   
[04/30/2021-08:51:33] [I] Dump refittable layers:Disabled                                                                                                                                                  
[04/30/2021-08:51:33] [I] Dump output: Disabled                                                                                                                                                            
[04/30/2021-08:51:33] [I] Profile: Enabled                                                                                                                                                                 
[04/30/2021-08:51:33] [I] Export timing to JSON file:                                                                                                                                                      
[04/30/2021-08:51:33] [I] Export output to JSON file:                                                                                                                                                      
[04/30/2021-08:51:33] [I] Export profile to JSON file:                                                                                                                                                     
[04/30/2021-08:51:33] [I]                                                                                                                                                                                  
[04/30/2021-08:51:33] [I] === Device Information ===                                                                                                                                                       
[04/30/2021-08:51:33] [I] Selected Device: GeForce GTX 1080 Ti                                                                                                                                             
[04/30/2021-08:51:33] [I] Compute Capability: 6.1                                                                                                                                                          
[04/30/2021-08:51:33] [I] SMs: 28                                                                                                                                                                          
[04/30/2021-08:51:33] [I] Compute Clock Rate: 1.6325 GHz                                                                                                                                                   
[04/30/2021-08:51:33] [I] Device Global Memory: 11178 MiB                                                                                                                                                  
[04/30/2021-08:51:33] [I] Shared Memory per SM: 96 KiB                                                                                                                                                     
[04/30/2021-08:51:33] [I] Memory Bus Width: 352 bits (ECC disabled)                                                                                                                                        
[04/30/2021-08:51:33] [I] Memory Clock Rate: 5.505 GHz                                                                                                                                                     
[04/30/2021-08:51:33] [I]                                                                                                                                                                                  
----------------------------------------------------------------                                                                                                                                           
Input filename:   resnet.onnx                                                                                                                                                                              
ONNX IR version:  0.0.6                                                                                                        
Opset version:    9                                                                                                            
Producer name:    pytorch                                                                                                      
Producer version: 1.8                                                                                                          
Domain:                                                                                                                        
Model version:    0                                                                                                            
Doc string:                                                                                                                    
----------------------------------------------------------------                                                               
[04/30/2021-08:51:46] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[04/30/2021-08:51:46] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[04/30/2021-08:54:01] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[04/30/2021-08:54:11] [I] [TRT] Detected 1 inputs and 1 output network tensors.                                                
[04/30/2021-08:54:12] [I] Engine built in 158.76 sec.                                                                          
[04/30/2021-08:54:12] [I] Starting inference                                                                                   
[04/30/2021-08:54:18] [I] Warmup completed 0 queries over 200 ms                                                               
[04/30/2021-08:54:18] [I] Timing trace has 0 queries over 5.67147 s                                                            
[04/30/2021-08:54:18] [I] Trace averages of 10 runs:                                                                           
[04/30/2021-08:54:18] [I] Average on 10 runs - GPU latency: 567.144 ms - Host latency: 567.144 ms (end to end 567.144 ms, enqueue 5.0075 ms)
[04/30/2021-08:54:18] [I] Host Latency                                                                                         
[04/30/2021-08:54:18] [I] min: 564.432 ms (end to end 564.432 ms)                                                              
[04/30/2021-08:54:18] [I] max: 569.332 ms (end to end 569.332 ms)                                                              
[04/30/2021-08:54:18] [I] mean: 567.144 ms (end to end 567.144 ms)                                                             
[04/30/2021-08:54:18] [I] median: 567.073 ms (end to end 567.073 ms)                                                           
[04/30/2021-08:54:18] [I] percentile: 569.332 ms at 99% (end to end 569.332 ms at 99%)                                         
[04/30/2021-08:54:18] [I] throughput: 0 qps                                                                                    
[04/30/2021-08:54:18] [I] walltime: 5.67147 s                                                                                  
[04/30/2021-08:54:18] [I] Enqueue Time                                                                                         
[04/30/2021-08:54:18] [I] min: 1.9541 ms                                                                                       
[04/30/2021-08:54:18] [I] max: 8.28448 ms                                                                                      
[04/30/2021-08:54:18] [I] median: 4.74475 ms                                                                                   
[04/30/2021-08:54:18] [I] GPU Compute                                                                                          
[04/30/2021-08:54:18] [I] min: 564.432 ms                                                                                      
[04/30/2021-08:54:18] [I] max: 569.332 ms                                                                                      
[04/30/2021-08:54:18] [I] mean: 567.144 ms                                                                                     
[04/30/2021-08:54:18] [I] median: 567.073 ms                                                                                   
[04/30/2021-08:54:18] [I] percentile: 569.332 ms at 99%                                                                        
[04/30/2021-08:54:18] [I] total compute time: 5.67144 s                                                                        
[04/30/2021-08:54:25] [I]                                                                                                      
[04/30/2021-08:54:25] [I] === Profile (11 iterations ) ===
[04/30/2021-08:54:25] [I]                                                           Layer   Time (ms)   Avg. Time (ms)   Time %                                                                            
[04/30/2021-08:54:25] [I]                             Conv_0 + Relu_1 input reformatter 0        6.97           0.6335      0.1
[04/30/2021-08:54:25] [I]                                                 Conv_0 + Relu_1      173.17          15.7430      2.8
[04/30/2021-08:54:25] [I]                                                 Conv_2 + Relu_3      169.07          15.3696      2.7
[04/30/2021-08:54:25] [I]                                                 Conv_4 + Relu_5      623.77          56.7059      9.9
[04/30/2021-08:54:25] [I]                                                 Conv_6 + Relu_7      272.34          24.7580      4.3
[04/30/2021-08:54:25] [I]                                                 Conv_8 + Relu_9      633.04          57.5493     10.1
[04/30/2021-08:54:25] [I]                                      Conv_10 + Add_11 + Relu_12      295.63          26.8753      4.7
[04/30/2021-08:54:25] [I]                                               Conv_13 + Relu_14      631.06          57.3693     10.1
[04/30/2021-08:54:25] [I]                                               Conv_15 + Relu_16      271.87          24.7155      4.3
[04/30/2021-08:54:25] [I]                                               Conv_17 + Relu_18      627.31          57.0278     10.0
[04/30/2021-08:54:25] [I]                                      Conv_19 + Add_20 + Relu_21      294.50          26.7729      4.7
[04/30/2021-08:54:25] [I]                                               Conv_22 + Relu_23      259.98          23.6343      4.1
[04/30/2021-08:54:25] [I]                                               Conv_24 + Relu_25       68.20           6.2004      1.1
[04/30/2021-08:54:25] [I]                           Conv_26 + Relu_27 input reformatter 0        7.67           0.6973      0.1
[04/30/2021-08:54:25] [I]                                               Conv_26 + Relu_27      199.37          18.1245      3.2
[04/30/2021-08:54:25] [I]                                     Conv_28 input reformatter 0       15.76           1.4331      0.3
[04/30/2021-08:54:25] [I]                                                         Conv_28       63.83           5.8026      1.0
[04/30/2021-08:54:25] [I]                                      Conv_29 + Add_30 + Relu_31       30.52           2.7749      0.5
[04/30/2021-08:54:25] [I]                                               Conv_32 + Relu_33      253.58          23.0530      4.0
[04/30/2021-08:54:25] [I]                                               Conv_34 + Relu_35       78.28           7.1166      1.2
[04/30/2021-08:54:25] [I]                                               Conv_36 + Relu_37      253.48          23.0439      4.0
[04/30/2021-08:54:25] [I]                                      Conv_38 + Add_39 + Relu_40       83.95           7.6318      1.3
[04/30/2021-08:54:25] [I]                                               Conv_41 + Relu_42      146.97          13.3608      2.3
[04/30/2021-08:54:25] [I]                                               Conv_43 + Relu_44       28.73           2.6118      0.5
[04/30/2021-08:54:25] [I]                           Conv_45 + Relu_46 input reformatter 0        1.96           0.1781      0.0
[04/30/2021-08:54:25] [I]                                               Conv_45 + Relu_46       89.20           8.1094      1.4
[04/30/2021-08:54:25] [I]                                     Conv_47 input reformatter 0        3.83           0.3485      0.1
[04/30/2021-08:54:25] [I]                                                         Conv_47       30.85           2.8043      0.5
[04/30/2021-08:54:25] [I]                                      Conv_48 + Add_49 + Relu_50        9.89           0.8994      0.2
[04/30/2021-08:54:25] [I]                           Conv_51 + Relu_52 input reformatter 0        1.97           0.1787      0.0
[04/30/2021-08:54:25] [I]                                               Conv_51 + Relu_52      101.72           9.2470      1.6
[04/30/2021-08:54:25] [I]                                               Conv_53 + Relu_54       41.94           3.8130      0.7
[04/30/2021-08:54:25] [I]                                               Conv_55 + Relu_56      101.44           9.2214      1.6
[04/30/2021-08:54:25] [I]                  Conv_57 + Add_58 + Relu_59 input reformatter 0        4.80           0.4360      0.1
[04/30/2021-08:54:25] [I]                                      Conv_57 + Add_58 + Relu_59       39.64           3.6038      0.6
[04/30/2021-08:54:25] [I]                           Conv_60 + Relu_61 input reformatter 0        1.95           0.1777      0.0
[04/30/2021-08:54:25] [I]                                               Conv_60 + Relu_61      122.45          11.1317      2.0
[04/30/2021-08:54:25] [I]                                               Conv_62 + Relu_63       17.26           1.5688      0.3
[04/30/2021-08:54:25] [I]                           Conv_64 + Relu_65 input reformatter 0        0.65           0.0587      0.0
[04/30/2021-08:54:25] [I]                                               Conv_64 + Relu_65       49.33           4.4849      0.8
[04/30/2021-08:54:25] [I]                                                         Conv_66       15.96           1.4506      0.3
[04/30/2021-08:54:25] [I]                                      Conv_67 + Add_68 + Relu_69        3.93           0.3571      0.1
[04/30/2021-08:54:25] [I]                                               Conv_70 + Relu_71       52.28           4.7532      0.8
[04/30/2021-08:54:25] [I]                                               Conv_72 + Relu_73       19.74           1.7943      0.3
[04/30/2021-08:54:25] [I]                                               Conv_74 + Relu_75       52.42           4.7657      0.8
[04/30/2021-08:54:25] [I]                                      Conv_76 + Add_77 + Relu_78       20.24           1.8400      0.3
[04/30/2021-08:54:25] [I]                                            GlobalAveragePool_79        1.13           0.1028      0.0
[04/30/2021-08:54:25] [I]  Flatten_80 + (Unnamed Layer* 81) [Shuffle] input reformatter 0        0.09           0.0079      0.0
[04/30/2021-08:54:25] [I]                                                         Gemm_81        0.18           0.0163      0.0
[04/30/2021-08:54:25] [I]                                                           Total     6273.90         570.3544    100.0
[04/30/2021-08:54:25] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun

20.12 :

root@10fb4bdae972:/veesion# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun                          
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --d
umpProfile --separateProfileRun
[04/30/2021-08:36:35] [I] === Model Options ===     
[04/30/2021-08:36:35] [I] Format: ONNX                        
[04/30/2021-08:36:35] [I] Model: resnet.onnx     
[04/30/2021-08:36:35] [I] Output:
[04/30/2021-08:36:35] [I] === Build Options ===         
[04/30/2021-08:36:35] [I] Max batch: explicit            
[04/30/2021-08:36:35] [I] Workspace: 5000 MiB         
[04/30/2021-08:36:35] [I] minTiming: 1                             
[04/30/2021-08:36:35] [I] avgTiming: 8                
[04/30/2021-08:36:35] [I] Precision: FP32+FP16+INT8
[04/30/2021-08:36:35] [I] Calibration: Dynamic                  
[04/30/2021-08:36:35] [I] Refit: Disabled
[04/30/2021-08:36:35] [I] Safe mode: Disabled
[04/30/2021-08:36:35] [I] Save engine: resnet.trt
[04/30/2021-08:36:35] [I] Load engine:
[04/30/2021-08:36:35] [I] Builder Cache: Enabled                                                                                                                                                    [04/30/2021-08:36:35] [I] NVTX verbosity: 0
[04/30/2021-08:36:35] [I] Tactic sources: Using default tactic sources
[04/30/2021-08:36:35] [I] Input(s): fp32:chw
[04/30/2021-08:36:35] [I] Output(s): fp32:chw
[04/30/2021-08:36:35] [I] Input build shapes: model
[04/30/2021-08:36:35] [I] Input calibration shapes: model
[04/30/2021-08:36:35] [I] === System Options ===
[04/30/2021-08:36:35] [I] Device: 0
[04/30/2021-08:36:35] [I] DLACore:
[04/30/2021-08:36:35] [I] Plugins:
[04/30/2021-08:36:35] [I] === Inference Options ===
[04/30/2021-08:36:35] [I] Batch: Explicit
[04/30/2021-08:36:35] [I] Input inference shapes: model
[04/30/2021-08:36:35] [I] Iterations: 10
[04/30/2021-08:36:35] [I] Duration: 3s (+ 200ms warm up)
[04/30/2021-08:36:35] [I] Sleep time: 0ms
[04/30/2021-08:36:35] [I] Streams: 1
[04/30/2021-08:36:35] [I] ExposeDMA: Disabled
[04/30/2021-08:36:35] [I] Data transfers: Disabled
[04/30/2021-08:36:35] [I] Spin-wait: Disabled
[04/30/2021-08:36:35] [I] Multithreading: Disabled
[04/30/2021-08:36:35] [I] CUDA Graph: Disabled
[04/30/2021-08:36:35] [I] Separate profiling: Enabled
[04/30/2021-08:36:35] [I] Skip inference: Disabled
[04/30/2021-08:36:35] [I] Inputs:
[04/30/2021-08:36:35] [I] === Reporting Options ===
[04/30/2021-08:36:35] [I] Verbose: Disabled
[04/30/2021-08:36:35] [I] Averages: 10 inferences
[04/30/2021-08:36:35] [I] Percentile: 99
[04/30/2021-08:36:35] [I] Dump refittable layers:Disabled
[04/30/2021-08:36:35] [I] Dump output: Disabled
[04/30/2021-08:36:35] [I] Profile: Enabled
[04/30/2021-08:36:35] [I] Export timing to JSON file:
[04/30/2021-08:36:35] [I] Export output to JSON file:
[04/30/2021-08:36:35] [I] Export profile to JSON file:
[04/30/2021-08:36:35] [I]
[04/30/2021-08:36:35] [I] === Device Information ===
[04/30/2021-08:36:35] [I] Selected Device: GeForce GTX 1080 Ti
[04/30/2021-08:36:35] [I] Compute Capability: 6.1
[04/30/2021-08:36:35] [I] SMs: 28
[04/30/2021-08:36:35] [I] Compute Clock Rate: 1.6325 GHz
[04/30/2021-08:36:35] [I] Device Global Memory: 11178 MiB
[04/30/2021-08:36:35] [I] Shared Memory per SM: 96 KiB
[04/30/2021-08:36:35] [I] Memory Bus Width: 352 bits (ECC disabled)
[04/30/2021-08:36:35] [I] Memory Clock Rate: 5.505 GHz
[04/30/2021-08:36:35] [I]
----------------------------------------------------------------
Input filename:   resnet.onnx
ONNX IR version:  0.0.6
Opset version:    9
Producer name:    pytorch
Producer version: 1.8
Domain:
Model version:    0
Doc string:
----------------------------------------------------------------
[04/30/2021-08:36:49] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[04/30/2021-08:36:49] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[04/30/2021-08:39:06] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[04/30/2021-08:39:16] [I] [TRT] Detected 1 inputs and 1 output network tensors.                                                
[04/30/2021-08:39:17] [I] Engine built in 161.698 sec.                                                                         
[04/30/2021-08:39:17] [I] Starting inference                                                                                   
[04/30/2021-08:39:24] [I] Warmup completed 0 queries over 200 ms                                                               
[04/30/2021-08:39:24] [I] Timing trace has 0 queries over 6.02128 s                                                            
[04/30/2021-08:39:24] [I] Trace averages of 10 runs:                                                                           
[04/30/2021-08:39:24] [I] Average on 10 runs - GPU latency: 602.126 ms - Host latency: 602.126 ms (end to end 602.126 ms, enqueue 3.9255 ms)
[04/30/2021-08:39:24] [I] Host Latency                                                                                         
[04/30/2021-08:39:24] [I] min: 599.799 ms (end to end 599.799 ms)                                                              
[04/30/2021-08:39:24] [I] max: 603.952 ms (end to end 603.952 ms)                                                              
[04/30/2021-08:39:24] [I] mean: 602.126 ms (end to end 602.126 ms)                                                             
[04/30/2021-08:39:24] [I] median: 602.148 ms (end to end 602.148 ms)                                                           
[04/30/2021-08:39:24] [I] percentile: 603.952 ms at 99% (end to end 603.952 ms at 99%)                                         
[04/30/2021-08:39:24] [I] throughput: 0 qps                                                                                    
[04/30/2021-08:39:24] [I] walltime: 6.02128 s                                                                                  
[04/30/2021-08:39:24] [I] Enqueue Time                                                                                         
[04/30/2021-08:39:24] [I] min: 3.33887 ms                                                                                      
[04/30/2021-08:39:24] [I] max: 4.47217 ms                                                                                      
[04/30/2021-08:39:24] [I] median: 3.82956 ms                                                                                   
[04/30/2021-08:39:24] [I] GPU Compute                                                                                          
[04/30/2021-08:39:24] [I] min: 599.799 ms                                                                                      
[04/30/2021-08:39:24] [I] max: 603.952 ms                                                                                      
[04/30/2021-08:39:24] [I] mean: 602.126 ms                                                                                     
[04/30/2021-08:39:24] [I] median: 602.148 ms
[04/30/2021-08:39:24] [I] percentile: 603.952 ms at 99%                                                                                                                                                    
[04/30/2021-08:39:24] [I] total compute time: 6.02126 s
[04/30/2021-08:39:30] [I]  
[04/30/2021-08:39:30] [I] === Profile (11 iterations ) ===
[04/30/2021-08:39:30] [I]                                                           Layer   Time (ms)   Avg. Time (ms)   Time %
[04/30/2021-08:39:30] [I]                             Conv_0 + Relu_1 input reformatter 0        6.80           0.6186      0.1
[04/30/2021-08:39:30] [I]                                                 Conv_0 + Relu_1      161.23          14.6573      2.4
[04/30/2021-08:39:30] [I]                                                 Conv_2 + Relu_3      140.68          12.7886      2.1
[04/30/2021-08:39:30] [I]                                                 Conv_4 + Relu_5      593.23          53.9304      8.9
[04/30/2021-08:39:30] [I]                                                 Conv_6 + Relu_7      221.12          20.1014      3.3
[04/30/2021-08:39:30] [I]                                                 Conv_8 + Relu_9      616.47          56.0428      9.3
[04/30/2021-08:39:30] [I]                                      Conv_10 + Add_11 + Relu_12      245.15          22.2864      3.7
[04/30/2021-08:39:30] [I]                                               Conv_13 + Relu_14      611.87          55.6248      9.2
[04/30/2021-08:39:30] [I]                                               Conv_15 + Relu_16      220.98          20.0887      3.3
[04/30/2021-08:39:30] [I]                                               Conv_17 + Relu_18      608.63          55.3298      9.2
[04/30/2021-08:39:30] [I]                                      Conv_19 + Add_20 + Relu_21      244.15          22.1955      3.7
[04/30/2021-08:39:30] [I]                 Conv_19 + Add_20 + Relu_21 output reformatter 0       28.64           2.6038      0.4
[04/30/2021-08:39:30] [I]                                               Conv_22 + Relu_23      308.68          28.0622      4.7
[04/30/2021-08:39:30] [I]                                               Conv_24 + Relu_25       77.74           7.0669      1.2
[04/30/2021-08:39:30] [I]                                               Conv_26 + Relu_27      280.21          25.4732      4.2
[04/30/2021-08:39:30] [I]                                                         Conv_28       70.68           6.4256      1.1
[04/30/2021-08:39:30] [I]                                      Conv_29 + Add_30 + Relu_31       38.21           3.4738      0.6
[04/30/2021-08:39:30] [I]                                               Conv_32 + Relu_33      409.39          37.2171      6.2
[04/30/2021-08:39:30] [I]                                               Conv_34 + Relu_35       88.38           8.0343      1.3
[04/30/2021-08:39:30] [I]                                               Conv_36 + Relu_37      408.98          37.1797      6.2
[04/30/2021-08:39:30] [I]                                      Conv_38 + Add_39 + Relu_40       98.79           8.9807      1.5
[04/30/2021-08:39:30] [I]                 Conv_38 + Add_39 + Relu_40 output reformatter 0        8.49           0.7723      0.1
[04/30/2021-08:39:30] [I]                                               Conv_41 + Relu_42      140.08          12.7344      2.1
[04/30/2021-08:39:30] [I]                                               Conv_43 + Relu_44       27.54           2.5037      0.4
[04/30/2021-08:39:30] [I]                                               Conv_45 + Relu_46      120.52          10.9566      1.8
[04/30/2021-08:39:30] [I]                                                         Conv_47       37.47           3.4063      0.6
[04/30/2021-08:39:30] [I]                                      Conv_48 + Add_49 + Relu_50        9.55           0.8679      0.1
[04/30/2021-08:39:30] [I]                                               Conv_51 + Relu_52      175.88          15.9887      2.7
[04/30/2021-08:39:30] [I]                                               Conv_53 + Relu_54       46.05           4.1864      0.7
[04/30/2021-08:39:30] [I]                                               Conv_55 + Relu_56      175.45          15.9497      2.6
[04/30/2021-08:39:30] [I]                                      Conv_57 + Add_58 + Relu_59       47.57           4.3244      0.7
[04/30/2021-08:39:30] [I]                           Conv_60 + Relu_61 input reformatter 0        1.87           0.1696      0.0
[04/30/2021-08:39:30] [I]                                               Conv_60 + Relu_61       43.41           3.9463      0.7
[04/30/2021-08:39:30] [I]                           Conv_62 + Relu_63 input reformatter 0        1.90           0.1725      0.0
[04/30/2021-08:39:30] [I]                                               Conv_62 + Relu_63       18.08           1.6440      0.3
[04/30/2021-08:39:30] [I]                                               Conv_64 + Relu_65       77.87           7.0794      1.2
[04/30/2021-08:39:30] [I]                                                         Conv_66       17.61           1.6013      0.3
[04/30/2021-08:39:30] [I]                                      Conv_67 + Add_68 + Relu_69        3.73           0.3395      0.1
[04/30/2021-08:39:30] [I]                                               Conv_70 + Relu_71       79.39           7.2176      1.2
[04/30/2021-08:39:30] [I]                                               Conv_72 + Relu_73       21.48           1.9528      0.3
[04/30/2021-08:39:30] [I]                                               Conv_74 + Relu_75       78.93           7.1758      1.2
[04/30/2021-08:39:30] [I]                                      Conv_76 + Add_77 + Relu_78       21.82           1.9834      0.3
[04/30/2021-08:39:30] [I]                                            GlobalAveragePool_79        1.09           0.0989      0.0
[04/30/2021-08:39:30] [I]  Flatten_80 + (Unnamed Layer* 81) [Shuffle] input reformatter 0        0.08           0.0069      0.0
[04/30/2021-08:39:30] [I]                                                         Gemm_81        0.17           0.0153      0.0
[04/30/2021-08:39:30] [I]                                                           Total     6636.03         603.2751    100.0
[04/30/2021-08:39:30] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
ttyio commented 3 years ago

Hello @hyperfraise , compare gpu compute median,

  [04/30/2021-08:54:18] [I] median: 567.073 ms 

with

  [04/30/2021-08:39:24] [I] median: 602.148 ms

it is (602.148-567.073)/567.073 = 6.1%, I did not see 28% regression. Could you also take a check? thanks

dmenig commented 3 years ago

Hello @hyperfraise , compare gpu compute median,

  [04/30/2021-08:54:18] [I] median: 567.073 ms 

with

  [04/30/2021-08:39:24] [I] median: 602.148 ms

it is (602.148-567.073)/567.073 = 6.1%, I did not see 28% regression. Could you also take a check? thanks

I was talking about another model for 28% regression. For this model I observed 7.5 spl/s compared to 6.8, which is ~10% regression. I don't think the stats from the dump are perfectly representative of the real mean inference speed. 6% is still a regression anyways (enough to make me think the problem is the same than the one I osberve with the other model, which has 28% regression !)

dmenig commented 3 years ago

So @ttyio do you think you guys might be able to do something to solve this ?

ttyio commented 3 years ago

Hello @hyperfraise , I can repro the 6% regression on Pascal device, but given the limited develop bandwidth, sorry it is not in top priority queue. Could you try latest 8.0 EA release? thanks

dmenig commented 3 years ago

Sure, I'll just wait for the nvcr release of TensorRT 8.0 in a new docker image. I hope you guys don't feel overwhelmed and manage to code stress free.

dmenig commented 3 years ago

Oh. FYI on 21.05 (which is TensorRT 7.2.3-1), results are even worse :/

root@5e3e8fe1e488:/veesion# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun                                                                                                                                                                                
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --d
umpProfile --separateProfileRun                                                                                                                                                                            
[05/21/2021-09:56:36] [I] === Model Options ===                                                                                                                                                            
[05/21/2021-09:56:36] [I] Format: ONNX                                                                                                                                                                     
[05/21/2021-09:56:36] [I] Model: resnet.onnx                                                                                                                                                               
[05/21/2021-09:56:36] [I] Output:                                                                                                                                                                          
[05/21/2021-09:56:36] [I] === Build Options ===                                                                                                                                                            
[05/21/2021-09:56:36] [I] Max batch: explicit                                                                                                                                                              
[05/21/2021-09:56:36] [I] Workspace: 5000 MiB                                                                                                                                                              
[05/21/2021-09:56:36] [I] minTiming: 1                                                                                                                                                                     
[05/21/2021-09:56:36] [I] avgTiming: 8                                                                                                                                                                     
[05/21/2021-09:56:36] [I] Precision: FP32+FP16+INT8                                                                                                                                                        
[05/21/2021-09:56:36] [I] Calibration: Dynamic                                                                                                                                                             
[05/21/2021-09:56:36] [I] Refit: Disabled                                                                                                                                                                  
[05/21/2021-09:56:36] [I] Safe mode: Disabled                                                                                                                                                              
[05/21/2021-09:56:36] [I] Save engine: resnet.trt                                                                                                                                                          
[05/21/2021-09:56:36] [I] Load engine:                                                                                                                                                                     
[05/21/2021-09:56:36] [I] Builder Cache: Enabled                                                                                                                                                           
[05/21/2021-09:56:36] [I] NVTX verbosity: 0                                                                                                                                                                
[05/21/2021-09:56:36] [I] Tactic sources: Using default tactic sources                                                                                                                                     
[05/21/2021-09:56:36] [I] Input(s): fp32:chw                                                                                                                                                               
[05/21/2021-09:56:36] [I] Output(s): fp32:chw                                                                                                                                                              
[05/21/2021-09:56:36] [I] Input build shapes: model                                                                                                                                                        
[05/21/2021-09:56:36] [I] Input calibration shapes: model                                                                                                                                                  
[05/21/2021-09:56:36] [I] === System Options ===                                                                                                                                                           
[05/21/2021-09:56:36] [I] Device: 0                                                                                                                                                                        
[05/21/2021-09:56:36] [I] DLACore:                                                                                                                                                                         
[05/21/2021-09:56:36] [I] Plugins:                                                                                                                                                                         
[05/21/2021-09:56:36] [I] === Inference Options ===                                                                                                                                                        
[05/21/2021-09:56:36] [I] Batch: Explicit                                                                                                                                                                  
[05/21/2021-09:56:36] [I] Input inference shapes: model                                                                                                                                                    
[05/21/2021-09:56:36] [I] Iterations: 10                                                                                                                                                                   
[05/21/2021-09:56:36] [I] Duration: 3s (+ 200ms warm up)                                                                                                                                                   
[05/21/2021-09:56:36] [I] Sleep time: 0ms                                                                                                                                                                  
[05/21/2021-09:56:36] [I] Streams: 1                                                                                                                                                                       
[05/21/2021-09:56:36] [I] ExposeDMA: Disabled                                                                                                                                                              
[05/21/2021-09:56:36] [I] Data transfers: Disabled                                                                                                                                                         
[05/21/2021-09:56:36] [I] Spin-wait: Disabled                                                                                                                                                              
[05/21/2021-09:56:36] [I] Multithreading: Disabled                                                                                                                                                         
[05/21/2021-09:56:36] [I] CUDA Graph: Disabled                                                                                                                                                             
[05/21/2021-09:56:36] [I] Separate profiling: Enabled                                                                                                                                                      
[05/21/2021-09:56:36] [I] Skip inference: Disabled                                                                                                                                                         
[05/21/2021-09:56:36] [I] Inputs:                                                                                                                                                                          
[05/21/2021-09:56:36] [I] === Reporting Options ===                                                                                                                                                        
[05/21/2021-09:56:36] [I] Verbose: Disabled                                                                                                                                                                
[05/21/2021-09:56:36] [I] Averages: 10 inferences                                                                                                                                                          
[05/21/2021-09:56:36] [I] Percentile: 99                                                                                                                                                                   
[05/21/2021-09:56:36] [I] Dump refittable layers:Disabled                                                                                                                                                  
[05/21/2021-09:56:36] [I] Dump output: Disabled                                                                                                                                                            
[05/21/2021-09:56:36] [I] Profile: Enabled                                                                                                                                                                 
[05/21/2021-09:56:36] [I] Export timing to JSON file:                                                                                                                                                      
[05/21/2021-09:56:36] [I] Export output to JSON file:                                                                                                                                                      
[05/21/2021-09:56:36] [I] Export profile to JSON file:                                                                                                                                                     
[05/21/2021-09:56:36] [I]                                                                                                                                                                                  
[05/21/2021-09:56:36] [I] === Device Information ===                                                                                                                                                       
[05/21/2021-09:56:36] [I] Selected Device: NVIDIA GeForce GTX 1080 Ti                                                                                                                                      
[05/21/2021-09:56:36] [I] Compute Capability: 6.1                                                                                                                                                          
[05/21/2021-09:56:36] [I] SMs: 28                                                                                                                                                                          
[05/21/2021-09:56:36] [I] Compute Clock Rate: 1.6325 GHz                                                                                                                                                   
[05/21/2021-09:56:36] [I] Device Global Memory: 11178 MiB                                                                                                                                                  
[05/21/2021-09:56:36] [I] Shared Memory per SM: 96 KiB                                                                                                                                                     
[05/21/2021-09:56:36] [I] Memory Bus Width: 352 bits (ECC disabled)                                                                                                                                        
[05/21/2021-09:56:36] [I] Memory Clock Rate: 5.505 GHz                                                                                                                                                     
[05/21/2021-09:56:36] [I]                                                                                                                                                                                  
[05/21/2021-09:56:49] [I] [TRT] ----------------------------------------------------------------                                                                                                           
[05/21/2021-09:56:49] [I] [TRT] Input filename:   resnet.onnx                                                                                                                                              
[05/21/2021-09:56:49] [I] [TRT] ONNX IR version:  0.0.6                                                                                                                                                    
[05/21/2021-09:56:49] [I] [TRT] Opset version:    9                                                                                                                                                        
[05/21/2021-09:56:49] [I] [TRT] Producer name:    pytorch                                                                                                                                                  
[05/21/2021-09:56:49] [I] [TRT] Producer version: 1.8                                                                                                                                                      
[05/21/2021-09:56:49] [I] [TRT] Domain:                                                                                                                                                                    
[05/21/2021-09:56:49] [I] [TRT] Model version:    0                                                                                                                                                        
[05/21/2021-09:56:49] [I] [TRT] Doc string:                                                                                                                                                                
[05/21/2021-09:56:49] [I] [TRT] ----------------------------------------------------------------                                                                                                           
[05/21/2021-09:56:49] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.                                                                  
[05/21/2021-09:56:49] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.                                                                         
[05/21/2021-09:59:09] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.                              
[05/21/2021-09:59:21] [I] [TRT] Detected 1 inputs and 1 output network tensors.                                                                                                                            
[05/21/2021-09:59:22] [I] Engine built in 165.295 sec.                                                                                                                                                     
[05/21/2021-09:59:22] [I] Starting inference                                                                                                                                                               
[05/21/2021-09:59:29] [I] Warmup completed 0 queries over 200 ms                                                                                                                                           
[05/21/2021-09:59:29] [I] Timing trace has 0 queries over 6.43171 s                                                                                                                                        
[05/21/2021-09:59:29] [I] Trace averages of 10 runs:                                                                                                                                                       
[05/21/2021-09:59:29] [I] Average on 10 runs - GPU latency: 643.169 ms - Host latency: 643.169 ms (end to end 643.169 ms, enqueue 4.60832 ms)                                                              
[05/21/2021-09:59:29] [I] Host Latency                                                                                                                                                                     
[05/21/2021-09:59:29] [I] min: 638.401 ms (end to end 638.401 ms)                                                                                                                                          
[05/21/2021-09:59:29] [I] max: 646.992 ms (end to end 646.992 ms)                                                                                                                                          
[05/21/2021-09:59:29] [I] mean: 643.169 ms (end to end 643.169 ms)                                                                                                                                         
[05/21/2021-09:59:29] [I] median: 643.202 ms (end to end 643.202 ms)                                                                                                                                       
[05/21/2021-09:59:29] [I] percentile: 646.992 ms at 99% (end to end 646.992 ms at 99%)                                                                                                                     
[05/21/2021-09:59:29] [I] throughput: 0 qps                                                                                                                                                                
[05/21/2021-09:59:29] [I] walltime: 6.43171 s                                                                                                                                                              
[05/21/2021-09:59:29] [I] Enqueue Time                                                                                                                                                                     
[05/21/2021-09:59:29] [I] min: 4.18799 ms                                                                                                                                                                  
[05/21/2021-09:59:29] [I] max: 5.26367 ms                                                                                                                                                                  
[05/21/2021-09:59:29] [I] median: 4.59708 ms                                                                                                                                                               
[05/21/2021-09:59:29] [I] GPU Compute                                                                                                                                                                      
[05/21/2021-09:59:29] [I] min: 638.401 ms                                                                                                                                                                  
[05/21/2021-09:59:29] [I] max: 646.992 ms                                                                                                                                                                  
[05/21/2021-09:59:29] [I] mean: 643.169 ms                                                                                                                                                                 
[05/21/2021-09:59:29] [I] median: 643.202 ms                                                                                                                                                               
[05/21/2021-09:59:29] [I] percentile: 646.992 ms at 99%                                                                                                                                                    
[05/21/2021-09:59:29] [I] total compute time: 6.43169 s                                                                                                                                                    
[05/21/2021-09:59:36] [I]                                                                                                                                                                                  
[05/21/2021-09:59:36] [I] === Profile (11 iterations ) ===                                                                                                                                                 
[05/21/2021-09:59:36] [I]                                                           Layer   Time (ms)   Avg. Time (ms)   Time %                                                                            
[05/21/2021-09:59:36] [I]                             Conv_0 + Relu_1 input reformatter 0        6.78           0.6160      0.1                                                                            
[05/21/2021-09:59:36] [I]                                                 Conv_0 + Relu_1      172.72          15.7019      2.4                                                                            
[05/21/2021-09:59:36] [I]                                                 Conv_2 + Relu_3      146.27          13.2970      2.1                                                                            
[05/21/2021-09:59:36] [I]                                                 Conv_4 + Relu_5      624.73          56.7937      8.8
[05/21/2021-09:59:36] [I]                                                 Conv_6 + Relu_7      230.88          20.9890      3.2
[05/21/2021-09:59:36] [I]                                                 Conv_8 + Relu_9      646.51          58.7733      9.1
[05/21/2021-09:59:36] [I]                                      Conv_10 + Add_11 + Relu_12      254.60          23.1453      3.6
[05/21/2021-09:59:36] [I]                                               Conv_13 + Relu_14      646.50          58.7731      9.1
[05/21/2021-09:59:36] [I]                                               Conv_15 + Relu_16      231.37          21.0339      3.3
[05/21/2021-09:59:36] [I]                                               Conv_17 + Relu_18      644.87          58.6249      9.1
[05/21/2021-09:59:36] [I]                                      Conv_19 + Add_20 + Relu_21      254.10          23.0998      3.6
[05/21/2021-09:59:36] [I]                 Conv_19 + Add_20 + Relu_21 output reformatter 0       30.56           2.7778      0.4
[05/21/2021-09:59:36] [I]                                               Conv_22 + Relu_23      347.64          31.6034      4.9
[05/21/2021-09:59:36] [I]                                               Conv_24 + Relu_25       79.01           7.1826      1.1
[05/21/2021-09:59:36] [I]                                               Conv_26 + Relu_27      315.81          28.7103      4.4
[05/21/2021-09:59:36] [I]                                                         Conv_28       71.34           6.4858      1.0
[05/21/2021-09:59:36] [I]                                      Conv_29 + Add_30 + Relu_31       40.39           3.6723      0.6
[05/21/2021-09:59:36] [I]                                               Conv_32 + Relu_33      461.13          41.9213      6.5
[05/21/2021-09:59:36] [I]                                               Conv_34 + Relu_35       90.01           8.1824      1.3
[05/21/2021-09:59:36] [I]                                               Conv_36 + Relu_37      457.58          41.5979      6.4
[05/21/2021-09:59:36] [I]                                      Conv_38 + Add_39 + Relu_40      100.24           9.1126      1.4
[05/21/2021-09:59:36] [I]                 Conv_38 + Add_39 + Relu_40 output reformatter 0        8.65           0.7863      0.1
[05/21/2021-09:59:36] [I]                                               Conv_41 + Relu_42      160.36          14.5785      2.3
[05/21/2021-09:59:36] [I]                                               Conv_43 + Relu_44       28.89           2.6267      0.4
[05/21/2021-09:59:36] [I]                                               Conv_45 + Relu_46      129.67          11.7878      1.8
[05/21/2021-09:59:36] [I]                                                         Conv_47       38.74           3.5214      0.5
[05/21/2021-09:59:36] [I]                                      Conv_48 + Add_49 + Relu_50        9.78           0.8890      0.1
[05/21/2021-09:59:36] [I]                                               Conv_51 + Relu_52      188.34          17.1221      2.6
[05/21/2021-09:59:36] [I]                                               Conv_53 + Relu_54       47.58           4.3254      0.7
[05/21/2021-09:59:36] [I]                                               Conv_55 + Relu_56      187.94          17.0853      2.6
[05/21/2021-09:59:36] [I]                                      Conv_57 + Add_58 + Relu_59       49.01           4.4556      0.7
[05/21/2021-09:59:36] [I]                           Conv_60 + Relu_61 input reformatter 0        1.96           0.1779      0.0
[05/21/2021-09:59:36] [I]                                               Conv_60 + Relu_61       45.18           4.1073      0.6
[05/21/2021-09:59:36] [I]                           Conv_62 + Relu_63 input reformatter 0        1.93           0.1752      0.0
[05/21/2021-09:59:36] [I]                                               Conv_62 + Relu_63       18.87           1.7156      0.3
[05/21/2021-09:59:36] [I]                           Conv_64 + Relu_65 input reformatter 0        0.59           0.0536      0.0
[05/21/2021-09:59:36] [I]                                               Conv_64 + Relu_65       79.72           7.2476      1.1
[05/21/2021-09:59:36] [I]                                     Conv_66 input reformatter 0        1.10           0.1000      0.0
[05/21/2021-09:59:36] [I]                                                         Conv_66       18.49           1.6807      0.3
[05/21/2021-09:59:36] [I]                                      Conv_67 + Add_68 + Relu_69        3.82           0.3477      0.1
[05/21/2021-09:59:36] [I]                           Conv_70 + Relu_71 input reformatter 0        0.59           0.0538      0.0
[05/21/2021-09:59:36] [I]                                               Conv_70 + Relu_71       93.10           8.4637      1.3
[05/21/2021-09:59:36] [I]                           Conv_72 + Relu_73 input reformatter 0        1.36           0.1233      0.0
[05/21/2021-09:59:36] [I]                                               Conv_72 + Relu_73       22.65           2.0595      0.3
[05/21/2021-09:59:36] [I]                           Conv_74 + Relu_75 input reformatter 0        0.59           0.0534      0.0
[05/21/2021-09:59:36] [I]                                               Conv_74 + Relu_75       92.77           8.4334      1.3
[05/21/2021-09:59:36] [I]                  Conv_76 + Add_77 + Relu_78 input reformatter 0        1.36           0.1234      0.0
[05/21/2021-09:59:36] [I]                                      Conv_76 + Add_77 + Relu_78       23.08           2.0984      0.3
[05/21/2021-09:59:36] [I]                                            GlobalAveragePool_79        1.12           0.1021      0.0
[05/21/2021-09:59:36] [I]  Flatten_80 + (Unnamed Layer* 81) [Shuffle] input reformatter 0        0.08           0.0073      0.0
[05/21/2021-09:59:36] [I]                                                         Gemm_81        0.16           0.0148      0.0
[05/21/2021-09:59:36] [I]                                                           Total     7110.52         646.4112    100.0
[05/21/2021-09:59:36] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --du
mpProfile --separateProfileRun

So about -13% regression.

ttyio commented 3 years ago

Hello @hyperfraise , could you take a try again on 21.05 without --best? thanks

dmenig commented 3 years ago

Hi, sure ! On 21.05, here my results :

Without --best, here is my output :

root@6b4242bad94e:/workspace# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfil
e --separateProfileRun                                                                                                                                                                                     
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProf
ile --separateProfileRun                                                                                                                                                                                   
[06/02/2021-14:59:02] [I] === Model Options ===                                                                                                                                                            
[06/02/2021-14:59:02] [I] Format: ONNX                                                                                                                                                                     
[06/02/2021-14:59:02] [I] Model: resnet.onnx                                                                                                                                                               
[06/02/2021-14:59:02] [I] Output:                                                                                                                                                                          
[06/02/2021-14:59:02] [I] === Build Options ===                                                                                                                                                            
[06/02/2021-14:59:02] [I] Max batch: explicit                                                                                                                                                              
[06/02/2021-14:59:02] [I] Workspace: 5000 MiB                                                                                                                                                              
[06/02/2021-14:59:02] [I] minTiming: 1                                                                                                                                                                     
[06/02/2021-14:59:02] [I] avgTiming: 8                                                                                                                                                                     
[06/02/2021-14:59:02] [I] Precision: FP32                                                                                                                                                                  
[06/02/2021-14:59:02] [I] Calibration:                                                                                                                                                                     
[06/02/2021-14:59:02] [I] Refit: Disabled                                                                                                                                                                  
[06/02/2021-14:59:02] [I] Safe mode: Disabled                                                                                                                                                              
[06/02/2021-14:59:02] [I] Save engine: resnet.trt                                                                                                                                                          
[06/02/2021-14:59:02] [I] Load engine:                                                                                                                                                                     
[06/02/2021-14:59:02] [I] Builder Cache: Enabled                                                                                                                                                           
[06/02/2021-14:59:02] [I] NVTX verbosity: 0                                                                                                                                                                
[06/02/2021-14:59:02] [I] Tactic sources: Using default tactic sources                                                                                                                                     
[06/02/2021-14:59:02] [I] Input(s): fp32:chw                                                                                                                                                               
[06/02/2021-14:59:02] [I] Output(s): fp32:chw                                                                                                                                                              
[06/02/2021-14:59:02] [I] Input build shapes: model
[06/02/2021-14:59:02] [I] Input calibration shapes: model
[06/02/2021-14:59:02] [I] === System Options ===
[06/02/2021-14:59:02] [I] Device: 0
[06/02/2021-14:59:02] [I] DLACore:
[06/02/2021-14:59:02] [I] Plugins:
[06/02/2021-14:59:02] [I] === Inference Options ===
[06/02/2021-14:59:02] [I] Batch: Explicit
[06/02/2021-14:59:02] [I] Input inference shapes: model
[06/02/2021-14:59:02] [I] Iterations: 10
[06/02/2021-14:59:02] [I] Duration: 3s (+ 200ms warm up)
[06/02/2021-14:59:02] [I] Sleep time: 0ms
[06/02/2021-14:59:02] [I] Streams: 1
[06/02/2021-14:59:02] [I] ExposeDMA: Disabled
[06/02/2021-14:59:02] [I] Data transfers: Disabled
[06/02/2021-14:59:02] [I] Spin-wait: Disabled
[06/02/2021-14:59:02] [I] Multithreading: Disabled
[06/02/2021-14:59:02] [I] CUDA Graph: Disabled
[06/02/2021-14:59:02] [I] Separate profiling: Enabled
[06/02/2021-14:59:02] [I] Skip inference: Disabled
[06/02/2021-14:59:02] [I] Inputs:
[06/02/2021-14:59:02] [I] === Reporting Options ===
[06/02/2021-14:59:02] [I] Verbose: Disabled
[06/02/2021-14:59:02] [I] Averages: 10 inferences
[06/02/2021-14:59:02] [I] Percentile: 99
[06/02/2021-14:59:02] [I] Dump refittable layers:Disabled
[06/02/2021-14:59:02] [I] Dump output: Disabled
[06/02/2021-14:59:02] [I] Profile: Enabled
[06/02/2021-14:59:02] [I] Export timing to JSON file:
[06/02/2021-14:59:02] [I] Export output to JSON file:
[06/02/2021-14:59:02] [I] Export profile to JSON file:
[06/02/2021-14:59:02] [I]
[06/02/2021-14:59:02] [I] === Device Information ===
[06/02/2021-14:59:02] [I] Selected Device: NVIDIA GeForce GTX 1080 Ti
[06/02/2021-14:59:02] [I] Compute Capability: 6.1
[06/02/2021-14:59:02] [I] SMs: 28
[06/02/2021-14:59:02] [I] Compute Clock Rate: 1.62 GHz
[06/02/2021-14:59:02] [I] Device Global Memory: 11177 MiB
[06/02/2021-14:59:02] [I] Shared Memory per SM: 96 KiB
[06/02/2021-14:59:02] [I] Memory Bus Width: 352 bits (ECC disabled)
[06/02/2021-14:59:02] [I] Memory Clock Rate: 5.505 GHz
[06/02/2021-14:59:02] [I]
[06/02/2021-14:59:12] [I] [TRT] ----------------------------------------------------------------
[06/02/2021-14:59:12] [I] [TRT] Input filename:   resnet.onnx
[06/02/2021-14:59:12] [I] [TRT] ONNX IR version:  0.0.6
[06/02/2021-14:59:12] [I] [TRT] Opset version:    9
[06/02/2021-14:59:12] [I] [TRT] Producer name:    pytorch
[06/02/2021-14:59:12] [I] [TRT] Producer version: 1.8
[06/02/2021-14:59:12] [I] [TRT] Domain:
[06/02/2021-14:59:12] [I] [TRT] Model version:    0
[06/02/2021-14:59:12] [I] [TRT] Doc string:
[06/02/2021-14:59:12] [I] [TRT] ----------------------------------------------------------------
[06/02/2021-15:00:22] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[06/02/2021-15:00:26] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[06/02/2021-15:00:27] [I] Engine built in 85.6795 sec.
[06/02/2021-15:00:27] [I] Starting inference
[06/02/2021-15:00:36] [I] Warmup completed 0 queries over 200 ms
[06/02/2021-15:00:36] [I] Timing trace has 0 queries over 7.97772 s
[06/02/2021-15:00:36] [I] Trace averages of 10 runs:
[06/02/2021-15:00:36] [I] Average on 10 runs - GPU latency: 797.77 ms - Host latency: 797.77 ms (end to end 797.77 ms, enqueue 3.45671 ms)
[06/02/2021-15:00:36] [I] Host Latency
[06/02/2021-15:00:36] [I] min: 768.763 ms (end to end 768.763 ms)
[06/02/2021-15:00:36] [I] max: 826.762 ms (end to end 826.762 ms)
[06/02/2021-15:00:36] [I] mean: 797.77 ms (end to end 797.77 ms)
[06/02/2021-15:00:36] [I] median: 788.8 ms (end to end 788.8 ms)
[06/02/2021-15:00:36] [I] percentile: 826.762 ms at 99% (end to end 826.762 ms at 99%)
[06/02/2021-15:00:36] [I] throughput: 0 qps
[06/02/2021-15:00:36] [I] walltime: 7.97772 s
[06/02/2021-15:00:36] [I] Enqueue Time
[06/02/2021-15:00:36] [I] min: 3.01349 ms
[06/02/2021-15:00:36] [I] max: 3.8457 ms
[06/02/2021-15:00:36] [I] median: 3.46265 ms
[06/02/2021-15:00:36] [I] GPU Compute
[06/02/2021-15:00:36] [I] min: 768.763 ms
[06/02/2021-15:00:36] [I] max: 826.762 ms
[06/02/2021-15:00:36] [I] mean: 797.77 ms
[06/02/2021-15:00:36] [I] median: 788.8 ms
[06/02/2021-15:00:36] [I] percentile: 826.762 ms at 99%
[06/02/2021-15:00:36] [I] total compute time: 7.9777 s
[06/02/2021-15:00:45] [I] 
[06/02/2021-15:00:45] [I] === Profile (11 iterations ) ===
[06/02/2021-15:00:45] [I]                       Layer   Time (ms)   Avg. Time (ms)   Time %
[06/02/2021-15:00:45] [I]             Conv_0 + Relu_1      193.32          17.5748      2.1
[06/02/2021-15:00:45] [I]             Conv_2 + Relu_3      223.45          20.3133      2.5
[06/02/2021-15:00:45] [I]             Conv_4 + Relu_5      705.83          64.1662      7.8
[06/02/2021-15:00:45] [I]             Conv_6 + Relu_7      522.64          47.5131      5.8
[06/02/2021-15:00:45] [I]             Conv_8 + Relu_9      706.56          64.2329      7.8
[06/02/2021-15:00:45] [I]  Conv_10 + Add_11 + Relu_12      565.31          51.3922      6.3
[06/02/2021-15:00:45] [I]           Conv_13 + Relu_14      706.37          64.2157      7.8
[06/02/2021-15:00:45] [I]           Conv_15 + Relu_16      523.00          47.5453      5.8
[06/02/2021-15:00:45] [I]           Conv_17 + Relu_18      707.66          64.3325      7.9
[06/02/2021-15:00:45] [I]  Conv_19 + Add_20 + Relu_21      566.94          51.5397      6.3
[06/02/2021-15:00:45] [I]           Conv_22 + Relu_23      363.44          33.0400      4.0
[06/02/2021-15:00:45] [I]           Conv_24 + Relu_25       81.25           7.3860      0.9
[06/02/2021-15:00:45] [I]           Conv_26 + Relu_27      333.62          30.3294      3.7
[06/02/2021-15:00:45] [I]                     Conv_28       74.46           6.7694      0.8
[06/02/2021-15:00:45] [I]  Conv_29 + Add_30 + Relu_31       41.54           3.7762      0.5
[06/02/2021-15:00:45] [I]           Conv_32 + Relu_33      489.90          44.5366      5.4
[06/02/2021-15:00:45] [I]           Conv_34 + Relu_35       93.35           8.4865      1.0
[06/02/2021-15:00:45] [I]           Conv_36 + Relu_37      487.77          44.3423      5.4
[06/02/2021-15:00:45] [I]  Conv_38 + Add_39 + Relu_40      103.68           9.4258      1.2
[06/02/2021-15:00:45] [I]           Conv_41 + Relu_42      264.03          24.0026      2.9
[06/02/2021-15:00:45] [I]           Conv_43 + Relu_44       32.95           2.9950      0.4
[06/02/2021-15:00:45] [I]           Conv_45 + Relu_46      135.13          12.2848      1.5
[06/02/2021-15:00:45] [I]                     Conv_47       50.09           4.5535      0.6
[06/02/2021-15:00:45] [I]  Conv_48 + Add_49 + Relu_50       17.67           1.6059      0.2
[06/02/2021-15:00:45] [I]           Conv_51 + Relu_52      197.70          17.9731      2.2
[06/02/2021-15:00:45] [I]           Conv_53 + Relu_54       62.66           5.6960      0.7
[06/02/2021-15:00:45] [I]           Conv_55 + Relu_56      197.99          17.9987      2.2
[06/02/2021-15:00:45] [I]  Conv_57 + Add_58 + Relu_59       65.34           5.9404      0.7
[06/02/2021-15:00:45] [I]           Conv_60 + Relu_61       48.82           4.4385      0.5
[06/02/2021-15:00:45] [I]           Conv_62 + Relu_63       33.37           3.0339      0.4
[06/02/2021-15:00:45] [I]           Conv_64 + Relu_65       86.51           7.8649      1.0
[06/02/2021-15:00:45] [I]                     Conv_66       32.86           2.9876      0.4
[06/02/2021-15:00:45] [I]  Conv_67 + Add_68 + Relu_69        5.36           0.4872      0.1
[06/02/2021-15:00:45] [I]           Conv_70 + Relu_71      100.95           9.1773      1.1
[06/02/2021-15:00:45] [I]           Conv_72 + Relu_73       39.48           3.5889      0.4
[06/02/2021-15:00:45] [I]           Conv_74 + Relu_75      101.06           9.1873      1.1
[06/02/2021-15:00:45] [I]  Conv_76 + Add_77 + Relu_78       40.85           3.7139      0.5
[06/02/2021-15:00:45] [I]        GlobalAveragePool_79        1.60           0.1451      0.0
[06/02/2021-15:00:45] [I]                     Gemm_81        0.17           0.0153      0.0
[06/02/2021-15:00:45] [I]                       Total     9004.69         818.6078    100.0
[06/02/2021-15:00:45] [I] 
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun

This is on another machine, so I feel compelled to add the results with --best as well on this machine :

root@6b4242bad94e:/workspace# /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dum
pProfile --separateProfileRun                                                                                                  
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --d
umpProfile --separateProfileRun                                                                                                
[06/02/2021-14:51:33] [I] === Model Options ===                                                                                
[06/02/2021-14:51:33] [I] Format: ONNX                                                                                                                                                                     
[06/02/2021-14:51:33] [I] Model: resnet.onnx                                                                                   
[06/02/2021-14:51:33] [I] Output:                                                                                              
[06/02/2021-14:51:33] [I] === Build Options ===                                                                                
[06/02/2021-14:51:33] [I] Max batch: explicit                                                                                  
[06/02/2021-14:51:33] [I] Workspace: 5000 MiB                                                                                  
[06/02/2021-14:51:33] [I] minTiming: 1                                                                                         
[06/02/2021-14:51:33] [I] avgTiming: 8                                                                                         
[06/02/2021-14:51:33] [I] Precision: FP32+FP16+INT8                                                                            
[06/02/2021-14:51:33] [I] Calibration: Dynamic                                                                                 
[06/02/2021-14:51:33] [I] Refit: Disabled                                                                                      
[06/02/2021-14:51:33] [I] Safe mode: Disabled                                                                                  
[06/02/2021-14:51:33] [I] Save engine: resnet.trt                                                                              
[06/02/2021-14:51:33] [I] Load engine:                                                                                         
[06/02/2021-14:51:33] [I] Builder Cache: Enabled                                                                               
[06/02/2021-14:51:33] [I] NVTX verbosity: 0                                                                                    
[06/02/2021-14:51:33] [I] Tactic sources: Using default tactic sources                                                         
[06/02/2021-14:51:33] [I] Input(s): fp32:chw                                                                                   
[06/02/2021-14:51:33] [I] Output(s): fp32:chw                                                                                  
[06/02/2021-14:51:33] [I] Input build shapes: model                                                                                      
[06/02/2021-14:51:33] [I] Input calibration shapes: model                                                                         
[06/02/2021-14:51:33] [I] === System Options ===                                                                                                                             
[06/02/2021-14:51:33] [I] Device: 0                                                                                            
[06/02/2021-14:51:33] [I] DLACore:                                                                                             
[06/02/2021-14:51:33] [I] Plugins:                                                                                             
[06/02/2021-14:51:33] [I] === Inference Options ===                                                                            
[06/02/2021-14:51:33] [I] Batch: Explicit                                                                                      
[06/02/2021-14:51:33] [I] Input inference shapes: model                                                                        
[06/02/2021-14:51:33] [I] Iterations: 10                                                                                                     
[06/02/2021-14:51:33] [I] Duration: 3s (+ 200ms warm up)                                                                       
[06/02/2021-14:51:33] [I] Sleep time: 0ms                                                                                      
[06/02/2021-14:51:33] [I] Streams: 1                                                                                           
[06/02/2021-14:51:33] [I] ExposeDMA: Disabled                                                                                  
[06/02/2021-14:51:33] [I] Data transfers: Disabled                                                                             
[06/02/2021-14:51:33] [I] Spin-wait: Disabled                                                                                  
[06/02/2021-14:51:33] [I] Multithreading: Disabled                                                                             
[06/02/2021-14:51:33] [I] CUDA Graph: Disabled                                                                                 
[06/02/2021-14:51:33] [I] Separate profiling: Enabled                                                                          
[06/02/2021-14:51:33] [I] Skip inference: Disabled                                                                             
[06/02/2021-14:51:33] [I] Inputs:                                                                                              
[06/02/2021-14:51:33] [I] === Reporting Options ===                                                                            
[06/02/2021-14:51:33] [I] Verbose: Disabled                                                                                    
[06/02/2021-14:51:33] [I] Averages: 10 inferences                                                                              
[06/02/2021-14:51:33] [I] Percentile: 99                                                                                       
[06/02/2021-14:51:33] [I] Dump refittable layers:Disabled                                                                      
[06/02/2021-14:51:33] [I] Dump output: Disabled                                                                                
[06/02/2021-14:51:33] [I] Profile: Enabled                                                                                     
[06/02/2021-14:51:33] [I] Export timing to JSON file:                                                                          
[06/02/2021-14:51:33] [I] Export output to JSON file:                                                                          
[06/02/2021-14:51:33] [I] Export profile to JSON file:                                                                         
[06/02/2021-14:51:33] [I]                                                                                                      
[06/02/2021-14:51:33] [I] === Device Information ===                                                                           
[06/02/2021-14:51:33] [I] Selected Device: NVIDIA GeForce GTX 1080 Ti                                                                                                                                      [06/02/2021-14:51:33] [I] Compute Capability: 6.1                                                                              
[06/02/2021-14:51:33] [I] SMs: 28                                                                                              
[06/02/2021-14:51:33] [I] Compute Clock Rate: 1.62 GHz                                                                         
[06/02/2021-14:51:33] [I] Device Global Memory: 11177 MiB                                                                      
[06/02/2021-14:51:33] [I] Shared Memory per SM: 96 KiB                                                                         
[06/02/2021-14:51:33] [I] Memory Bus Width: 352 bits (ECC disabled)                                                            
[06/02/2021-14:51:33] [I] Memory Clock Rate: 5.505 GHz                                                                         
[06/02/2021-14:51:33] [I]                                                                                                      
[06/02/2021-14:51:43] [I] [TRT] ----------------------------------------------------------------                               
[06/02/2021-14:51:43] [I] [TRT] Input filename:   resnet.onnx                                                                  
[06/02/2021-14:51:43] [I] [TRT] ONNX IR version:  0.0.6                                                                        
[06/02/2021-14:51:43] [I] [TRT] Opset version:    9                                                                            
[06/02/2021-14:51:43] [I] [TRT] Producer name:    pytorch                                                                      
[06/02/2021-14:51:43] [I] [TRT] Producer version: 1.8                                                                          
[06/02/2021-14:51:43] [I] [TRT] Domain:                                                                                        
[06/02/2021-14:51:43] [I] [TRT] Model version:    0                                                                            
[06/02/2021-14:51:43] [I] [TRT] Doc string:                                                                                    
[06/02/2021-14:51:43] [I] [TRT] ----------------------------------------------------------------                               
[06/02/2021-14:51:43] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[06/02/2021-14:51:43] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[06/02/2021-14:54:05] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[06/02/2021-14:54:16] [I] [TRT] Detected 1 inputs and 1 output network tensors.                                                
[06/02/2021-14:54:17] [I] Engine built in 164.476 sec.                                                                         
[06/02/2021-14:54:17] [I] Starting inference                                                                                   
[06/02/2021-14:54:25] [I] Warmup completed 0 queries over 200 ms                                                               
[06/02/2021-14:54:25] [I] Timing trace has 0 queries over 6.48338 s                                                            
[06/02/2021-14:54:25] [I] Trace averages of 10 runs:                                                                           
[06/02/2021-14:54:25] [I] Average on 10 runs - GPU latency: 648.337 ms - Host latency: 648.337 ms (end to end 648.337 ms, enqueue 5.02236 ms)
[06/02/2021-14:54:25] [I] Host Latency                                                                                         
[06/02/2021-14:54:25] [I] min: 635.064 ms (end to end 635.064 ms)                                                              
[06/02/2021-14:54:25] [I] max: 661.95 ms (end to end 661.95 ms)                                                                
[06/02/2021-14:54:25] [I] mean: 648.337 ms (end to end 648.337 ms)                                                             
[06/02/2021-14:54:25] [I] median: 651.916 ms (end to end 651.916 ms)                                                           
[06/02/2021-14:54:25] [I] percentile: 661.95 ms at 99% (end to end 661.95 ms at 99%)                                           
[06/02/2021-14:54:25] [I] throughput: 0 qps                                                                                    
[06/02/2021-14:54:25] [I] walltime: 6.48338 s                                                                                  
[06/02/2021-14:54:25] [I] Enqueue Time                                                                                         
[06/02/2021-14:54:25] [I] min: 3.5438 ms                                                                                       
[06/02/2021-14:54:25] [I] max: 5.24646 ms                                                                                      
[06/02/2021-14:54:25] [I] median: 5.19568 ms                                                                                   
[06/02/2021-14:54:25] [I] GPU Compute                                                                                          
[06/02/2021-14:54:25] [I] min: 635.064 ms                                                                                      
[06/02/2021-14:54:25] [I] max: 661.95 ms                                                                                       
[06/02/2021-14:54:25] [I] mean: 648.337 ms                                                                                     
[06/02/2021-14:54:25] [I] median: 651.916 ms                                                                                   
[06/02/2021-14:54:25] [I] percentile: 661.95 ms at 99%                                                                         
[06/02/2021-14:54:25] [I] total compute time: 6.48337 s                                                                        
[06/02/2021-14:54:32] [I]                                                                                                      
[06/02/2021-14:54:32] [I] === Profile (11 iterations ) ===                                                                     
[06/02/2021-14:54:32] [I]                                                           Layer   Time (ms)   Avg. Time (ms)   Time %
[06/02/2021-14:54:32] [I]                             Conv_0 + Relu_1 input reformatter 0        6.78           0.6165      0.1
[06/02/2021-14:54:32] [I]                                                 Conv_0 + Relu_1      181.18          16.4711       2.5                                                                            
[06/02/2021-14:54:32] [I]                                                 Conv_2 + Relu_3      149.82          13.6198      2.1
[06/02/2021-14:54:32] [I]                                                 Conv_4 + Relu_5      649.60          59.0542      8.9
[06/02/2021-14:54:32] [I]                                                 Conv_6 + Relu_7      232.50          21.1359      3.2
[06/02/2021-14:54:32] [I]                                                 Conv_8 + Relu_9      651.02          59.1834      9.0
[06/02/2021-14:54:32] [I]                                      Conv_10 + Add_11 + Relu_12      256.36          23.3053      3.5
[06/02/2021-14:54:32] [I]                                               Conv_13 + Relu_14      653.66          59.4238      9.0
[06/02/2021-14:54:32] [I]                                               Conv_15 + Relu_16      233.98          21.2712      3.2
[06/02/2021-14:54:32] [I]                                               Conv_17 + Relu_18      655.17          59.5611      9.0
[06/02/2021-14:54:32] [I]                                      Conv_19 + Add_20 + Relu_21      257.12          23.3748      3.5
[06/02/2021-14:54:32] [I]                 Conv_19 + Add_20 + Relu_21 output reformatter 0       31.11           2.8283      0.4
[06/02/2021-14:54:32] [I]                                               Conv_22 + Relu_23      353.59          32.1443      4.9
[06/02/2021-14:54:32] [I]                                               Conv_24 + Relu_25       79.96           7.2688      1.1
[06/02/2021-14:54:32] [I]                                               Conv_26 + Relu_27      323.08          29.3709      4.4
[06/02/2021-14:54:32] [I]                                                         Conv_28       72.87           6.6244      1.0
[06/02/2021-14:54:32] [I]                                      Conv_29 + Add_30 + Relu_31       40.84           3.7125      0.6
[06/02/2021-14:54:32] [I]                                               Conv_32 + Relu_33      472.22          42.9289      6.5
[06/02/2021-14:54:32] [I]                                               Conv_34 + Relu_35       91.85           8.3500      1.3
[06/02/2021-14:54:32] [I]                                               Conv_36 + Relu_37      471.57          42.8702      6.5
[06/02/2021-14:54:32] [I]                                      Conv_38 + Add_39 + Relu_40      101.94           9.2671      1.4
[06/02/2021-14:54:32] [I]                 Conv_38 + Add_39 + Relu_40 output reformatter 0        8.77           0.7973      0.1
[06/02/2021-14:54:32] [I]                                               Conv_41 + Relu_42      165.75          15.0680      2.3
[06/02/2021-14:54:32] [I]                                               Conv_43 + Relu_44       30.03           2.7297      0.4
[06/02/2021-14:54:32] [I]                                               Conv_45 + Relu_46      134.64          12.2402      1.9
[06/02/2021-14:54:32] [I]                                                         Conv_47       39.64           3.6035      0.5
[06/02/2021-14:54:32] [I]                                      Conv_48 + Add_49 + Relu_50        9.98           0.9072      0.1
[06/02/2021-14:54:32] [I]                                               Conv_51 + Relu_52      196.39          17.8537      2.7
[06/02/2021-14:54:32] [I]                                               Conv_53 + Relu_54       48.94           4.4493      0.7
[06/02/2021-14:54:32] [I]                                               Conv_55 + Relu_56      195.75          17.7952      2.7
[06/02/2021-14:54:32] [I]                                      Conv_57 + Add_58 + Relu_59       50.34           4.5768      0.7
[06/02/2021-14:54:32] [I]                           Conv_60 + Relu_61 input reformatter 0        2.03           0.1850      0.0
[06/02/2021-14:54:32] [I]                                               Conv_60 + Relu_61       46.81           4.2552      0.6
[06/02/2021-14:54:32] [I]                           Conv_62 + Relu_63 input reformatter 0        1.95           0.1775      0.0
[06/02/2021-14:54:32] [I]                                               Conv_62 + Relu_63       19.64           1.7850      0.3
[06/02/2021-14:54:32] [I]                           Conv_64 + Relu_65 input reformatter 0        0.61           0.0557      0.0
[06/02/2021-14:54:32] [I]                                               Conv_64 + Relu_65       83.39           7.5812      1.1
[06/02/2021-14:54:32] [I]                                     Conv_66 input reformatter 0        1.12           0.1021      0.0
[06/02/2021-14:54:32] [I]                                                         Conv_66       19.26           1.7512      0.3
[06/02/2021-14:54:32] [I]                                      Conv_67 + Add_68 + Relu_69        3.94           0.3586      0.1
[06/02/2021-14:54:32] [I]                           Conv_70 + Relu_71 input reformatter 0        0.61           0.0552      0.0
[06/02/2021-14:54:32] [I]                                               Conv_70 + Relu_71       97.39           8.8534      1.3
[06/02/2021-14:54:32] [I]                           Conv_72 + Relu_73 input reformatter 0        1.38           0.1253      0.0
[06/02/2021-14:54:32] [I]                                               Conv_72 + Relu_73       23.65           2.1500      0.3
[06/02/2021-14:54:32] [I]                           Conv_74 + Relu_75 input reformatter 0        0.61           0.0556      0.0
[06/02/2021-14:54:32] [I]                                               Conv_74 + Relu_75       97.34           8.8488      1.3
[06/02/2021-14:54:32] [I]                  Conv_76 + Add_77 + Relu_78 input reformatter 0        1.38           0.1250      0.0
[06/02/2021-14:54:32] [I]                                      Conv_76 + Add_77 + Relu_78       24.07           2.1878      0.3
[06/02/2021-14:54:32] [I]                                            GlobalAveragePool_79        1.15           0.1047      0.0
[06/02/2021-14:54:32] [I]  Flatten_80 + (Unnamed Layer* 81) [Shuffle] input reformatter 0        0.06           0.0056      0.0
[06/02/2021-14:54:32] [I]                                                         Gemm_81        0.17           0.0152      0.0
[06/02/2021-14:54:32] [I]                                                           Total     7273.00         661.1818    100.0
[06/02/2021-14:54:32] [I] 
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
dmenig commented 3 years ago

Here are the results with TensorRT 8. As you can see, it's still bad :/

I installed tensorrt manually with this file nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.0.3-ea-20210423_1-1_amd64.deb on this docker image : nvcr.io/nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun  
&&&& RUNNING TensorRT.trtexec [TensorRT v8000] # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --no
DataTransfers --dumpProfile --separateProfileRun                                                                                                                                                           
[06/23/2021-13:23:23] [I] === Model Options ===                                                                                                                                                            
[06/23/2021-13:23:23] [I] Format: ONNX                                                                                                                                                                     
[06/23/2021-13:23:23] [I] Model: resnet.onnx                                                                                                                                                               
[06/23/2021-13:23:23] [I] Output:                                                                                                                                                                          
[06/23/2021-13:23:23] [I] === Build Options ===                                                                                                                                                            
[06/23/2021-13:23:23] [I] Max batch: explicit                                                                                                                                                              
[06/23/2021-13:23:23] [I] Workspace: 5000 MiB                                                                                                                                                              
[06/23/2021-13:23:23] [I] minTiming: 1                                                                                                                                                                     
[06/23/2021-13:23:23] [I] avgTiming: 8                                                                                                                                                                     
[06/23/2021-13:23:23] [I] Precision: FP32+FP16+INT8                                                                                                                                                        
[06/23/2021-13:23:23] [I] Calibration: Dynamic                                                                                                                                                             
[06/23/2021-13:23:23] [I] Refit: Disabled                                                                                                                                                                  
[06/23/2021-13:23:23] [I] Sparsity: Disabled                                                                                                                                                               
[06/23/2021-13:23:23] [I] Safe mode: Disabled                                                                                                                                                              
[06/23/2021-13:23:23] [I] Enable serialization: Disabled                                                                                                                                                   
[06/23/2021-13:23:23] [I] Save engine: resnet.trt                                                                                                                                                          
[06/23/2021-13:23:23] [I] Load engine:                                                                                                                                                                     
[06/23/2021-13:23:23] [I] NVTX verbosity: 0                                                                                                                                                                
[06/23/2021-13:23:23] [I] Tactic sources: Using default tactic sources                                                                                                                                     
[06/23/2021-13:23:23] [I] timingCacheMode: local                                                                                                                                                           
[06/23/2021-13:23:23] [I] timingCacheFile:                                                                                                                                                                 
[06/23/2021-13:23:23] [I] Input(s): fp32:chw                                                                                                                                                               
[06/23/2021-13:23:23] [I] Output(s): fp32:chw                                                                                                                                                              
[06/23/2021-13:23:23] [I] Input build shapes: model                                                                                                                                                        
[06/23/2021-13:23:23] [I] Input calibration shapes: model                                                                                                                                                  
[06/23/2021-13:23:23] [I] === System Options ===                                                                                                                                                           
[06/23/2021-13:23:23] [I] Device: 0                                                                                                                                                                        
[06/23/2021-13:23:23] [I] DLACore:                                                                                                                                                                         
[06/23/2021-13:23:23] [I] Plugins:                                                                                                                                                                         
[06/23/2021-13:23:23] [I] === Inference Options ===                                                                                                                                                        
[06/23/2021-13:23:23] [I] Batch: Explicit                                                                                                                                                                  
[06/23/2021-13:23:23] [I] Input inference shapes: model                                                                                                                                                    
[06/23/2021-13:23:23] [I] Iterations: 10                                                                                                                                                                   
[06/23/2021-13:23:23] [I] Duration: 3s (+ 200ms warm up)                                                                                                                                                   
[06/23/2021-13:23:23] [I] Sleep time: 0ms                                                                                                                                                                  
[06/23/2021-13:23:23] [I] Streams: 1                                                                                                                                                                       
[06/23/2021-13:23:23] [I] ExposeDMA: Disabled                                                                                                                                                              
[06/23/2021-13:23:23] [I] Data transfers: Disabled                                                                                                                                                         
[06/23/2021-13:23:23] [I] Spin-wait: Disabled                                                                                                                                                              
[06/23/2021-13:23:23] [I] Multithreading: Disabled                                                                                                                                                         
[06/23/2021-13:23:23] [I] CUDA Graph: Disabled                                                                                                                                                             
[06/23/2021-13:23:23] [I] Separate profiling: Enabled                                                                                                                                                      
[06/23/2021-13:23:23] [I] Time Deserialize: Disabled                                                                                                                                                       
[06/23/2021-13:23:23] [I] Time Refit: Disabled                                                                                                                                                             
[06/23/2021-13:23:23] [I] Skip inference: Disabled                                                                                                                                                         
[06/23/2021-13:23:23] [I] Inputs:                                                                                                                                                                          
[06/23/2021-13:23:23] [I] === Reporting Options ===                                                                                                                                                        
[06/23/2021-13:23:23] [I] Verbose: Disabled                                                                                                                                                                
[06/23/2021-13:23:23] [I] Averages: 10 inferences                                                                                                                                                          
[06/23/2021-13:23:23] [I] Percentile: 99                                                                                                                                                                   
[06/23/2021-13:23:23] [I] Dump refittable layers:Disabled                                                                                                                                                  
[06/23/2021-13:23:23] [I] Dump output: Disabled                                                                                                                                                            
[06/23/2021-13:23:23] [I] Profile: Enabled                                                                                                                                                                 
[06/23/2021-13:23:23] [I] Export timing to JSON file:                                                                                                                                                      
[06/23/2021-13:23:23] [I] Export output to JSON file:                                                                                                                                                      
[06/23/2021-13:23:23] [I] Export profile to JSON file:                                                                                                                                                     
[06/23/2021-13:23:23] [I]                                                                                                                                                                                  
[06/23/2021-13:23:23] [I] === Device Information ===                                                                                                                                                       
[06/23/2021-13:23:23] [I] Selected Device: GeForce GTX 1080 Ti                                                                                                                                             
[06/23/2021-13:23:23] [I] Compute Capability: 6.1                                                                                                                                                          
[06/23/2021-13:23:23] [I] SMs: 28                                                                                                                                                                          
[06/23/2021-13:23:23] [I] Compute Clock Rate: 1.582 GHz                                                                                                                                                    
[06/23/2021-13:23:23] [I] Device Global Memory: 11176 MiB                                                                                                                                                  
[06/23/2021-13:23:23] [I] Shared Memory per SM: 96 KiB                                                                                                                                                     
[06/23/2021-13:23:23] [I] Memory Bus Width: 352 bits (ECC disabled)                                                                                                                                        
[06/23/2021-13:23:23] [I] Memory Clock Rate: 5.505 GHz                                                                                                                                                     
[06/23/2021-13:23:23] [I]                                                                                                                                                                                  
[06/23/2021-13:23:23] [I] TensorRT version: 8000                                                                                                                                                           
[06/23/2021-13:23:24] [I] [TRT] [MemUsageChange] Init CUDA: CPU +159, GPU +0, now: CPU 165, GPU 215 (MiB)                                                                                                  
[06/23/2021-13:23:24] [I] [TRT] ----------------------------------------------------------------                                                                                                           
[06/23/2021-13:23:24] [I] [TRT] Input filename:   resnet.onnx                                                                                                                                              
[06/23/2021-13:23:24] [I] [TRT] ONNX IR version:  0.0.6                                                                                                                                                    
[06/23/2021-13:23:24] [I] [TRT] Opset version:    9                                                                                                                                                        
[06/23/2021-13:23:24] [I] [TRT] Producer name:    pytorch                                                                                                                                                  
[06/23/2021-13:23:24] [I] [TRT] Producer version: 1.7                                                                                                                                                      
[06/23/2021-13:23:24] [I] [TRT] Domain:                                                                                                                                                                    
[06/23/2021-13:23:24] [I] [TRT] Model version:    0                                                                                                                                                        
[06/23/2021-13:23:24] [I] [TRT] Doc string:                                                                                                                                                                
[06/23/2021-13:23:24] [I] [TRT] ----------------------------------------------------------------                                                                                                           
[06/23/2021-13:23:24] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 245, GPU 215 (MiB)                                                                                                    
[06/23/2021-13:23:24] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 245 MiB, GPU 215 MiB                                                                                                                 
[06/23/2021-13:23:24] [W] [TRT] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.                                                                  
[06/23/2021-13:23:24] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.                                                                         
[06/23/2021-13:23:24] [W] [TRT] Convolution + generic activation fusion is disable due to incompatible driver or nvrtc                                                                                     
[06/23/2021-13:23:24] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +233, GPU +94, now: CPU 479, GPU 309 (MiB)                                                                                      
[06/23/2021-13:23:24] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +189, GPU +84, now: CPU 668, GPU 393 (MiB)                                                                                                
[06/23/2021-13:23:24] [W] [TRT] Detected invalid timing cache, setup a local cache instead                                                                                                                 
[06/23/2021-13:24:35] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.                              
[06/23/2021-13:24:40] [I] [TRT] Detected 1 inputs and 1 output network tensors.                                                                                                                            
[06/23/2021-13:24:40] [I] [TRT] Total Host Persistent Memory: 1536                                                                                                                                         
[06/23/2021-13:24:40] [I] [TRT] Total Device Persistent Memory: 0                                                                                                                                          
[06/23/2021-13:24:40] [I] [TRT] Total Scratch Memory: 508851200                                                                                                                                            
[06/23/2021-13:24:40] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 4 MiB                                                                                  
[06/23/2021-13:24:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 909, GPU 533 (MiB)                                                                                         
[06/23/2021-13:24:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 909, GPU 541 (MiB)                                                                                                   
[06/23/2021-13:24:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 909, GPU 525 (MiB)                                                                                         
[06/23/2021-13:24:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 908, GPU 507 (MiB)                                                                                         
[06/23/2021-13:24:40] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 908 MiB, GPU 507 MiB                                                                                                                   
[06/23/2021-13:24:41] [I] Engine built in 77.2374 sec.                                                                                                                                                     
[06/23/2021-13:24:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 827, GPU 517 (MiB)                                                                                        
[06/23/2021-13:24:41] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 827, GPU 525 (MiB)                                                                                                   
[06/23/2021-13:24:41] [I] Created input binding for 0 with dimensions 4x3x35x224x224                                                                                                                       
[06/23/2021-13:24:41] [I] Created output binding for 343 with dimensions 4x400                                                                                                                             
[06/23/2021-13:24:41] [I] Starting inference                                                                                                                                                               
[06/23/2021-13:24:48] [I] Warmup completed 1 queries over 200 ms                                                                                                                                           
[06/23/2021-13:24:48] [I] Timing trace has 10 queries over 6.36198 s                                                                                                                                       
[06/23/2021-13:24:48] [I]                                                                                                                                                                                  
[06/23/2021-13:24:48] [I] === Trace details ===                                                                                                                                                            
[06/23/2021-13:24:48] [I] Trace averages of 10 runs:                                                                                                                                                       
[06/23/2021-13:24:48] [I] Average on 10 runs - GPU latency: 636.196 ms - Host latency: 636.196 ms (end to end 636.196 ms, enqueue 2.68777 ms)                                                              
[06/23/2021-13:24:48] [I]                                                                                                                                                                                  
[06/23/2021-13:24:48] [I] === Performance summary ===                                                                                                                                                      
[06/23/2021-13:24:48] [I] Throughput: 1.57184 qps                                                                                                                                                          
[06/23/2021-13:24:48] [I] Latency: min = 634.644 ms, max = 638.123 ms, mean = 636.196 ms, median = 636.003 ms, percentile(99%) = 638.123 ms                                                                
[06/23/2021-13:24:48] [I] End-to-End Host Latency: min = 634.644 ms, max = 638.123 ms, mean = 636.196 ms, median = 636.003 ms, percentile(99%) = 638.123 ms                                                
[06/23/2021-13:24:48] [I] Enqueue Time: min = 2.53569 ms, max = 2.81494 ms, mean = 2.68777 ms, median = 2.74072 ms, percentile(99%) = 2.81494 ms                                                           
[06/23/2021-13:24:48] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms                                                                                          
[06/23/2021-13:24:48] [I] GPU Compute Time: min = 634.644 ms, max = 638.123 ms, mean = 636.196 ms, median = 636.003 ms, percentile(99%) = 638.123 ms                                                       
[06/23/2021-13:24:48] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms                                                                                          
[06/23/2021-13:24:48] [I] Total Host Walltime: 6.36198 s                                                                                                                                                   
[06/23/2021-13:24:48] [I] Total GPU Compute Time: 6.36196 s                                                                                                                                                
[06/23/2021-13:24:48] [I] Explanations of the performance metrics are printed in the verbose logs.                                                                                                         
[06/23/2021-13:24:48] [I]                                                                                                                                                                                  
[06/23/2021-13:24:55] [I]                                                                                                                                                                                  
[06/23/2021-13:24:55] [I] === Profile (11 iterations ) ===                                                                                                                                                 
[06/23/2021-13:24:55] [I]                                                                                   Layer   Time (ms)   Avg. Time (ms)   Time %                                                    
[06/23/2021-13:24:55] [I]                             Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1        7.29           0.6629      0.1                                                    
[06/23/2021-13:24:55] [I]                                                                         Conv_0 + Relu_1      168.48          15.3162      2.4                                                    
[06/23/2021-13:24:55] [I]                                                                         Conv_2 + Relu_3      142.70          12.9725      2.0                                                    
[06/23/2021-13:24:55] [I]                                                                         Conv_4 + Relu_5      618.20          56.1999      8.8                                                    
[06/23/2021-13:24:55] [I]                                                                         Conv_6 + Relu_7      230.13          20.9210      3.3                                                    
[06/23/2021-13:24:55] [I]                                                                         Conv_8 + Relu_9      639.12          58.1021      9.1                                                    
[06/23/2021-13:24:55] [I]                                                              Conv_10 + Add_11 + Relu_12      251.36          22.8513      3.6                                                    
[06/23/2021-13:24:55] [I]                                                                       Conv_13 + Relu_14      632.13          57.4666      9.0                                                    
[06/23/2021-13:24:55] [I]                                                                       Conv_15 + Relu_16      227.75          20.7047      3.3                                                    
[06/23/2021-13:24:55] [I]                                                                       Conv_17 + Relu_18      632.38          57.4888      9.0                                                    
[06/23/2021-13:24:55] [I]                                                              Conv_19 + Add_20 + Relu_21      250.15          22.7405      3.6                                                    
[06/23/2021-13:24:55] [I]                 Reformatting CopyNode for Output Tensor 0 to Conv_19 + Add_20 + Relu_21       29.86           2.7147      0.4
[06/23/2021-13:24:55] [I]                                                                       Conv_22 + Relu_23      340.15          30.9225      4.9
[06/23/2021-13:24:55] [I]                                                                       Conv_24 + Relu_25       78.56           7.1415      1.1
[06/23/2021-13:24:55] [I]                                                                       Conv_26 + Relu_27      308.86          28.0779      4.4
[06/23/2021-13:24:55] [I]                                                                                 Conv_28       71.43           6.4938      1.0
[06/23/2021-13:24:55] [I]                                                              Conv_29 + Add_30 + Relu_31       39.99           3.6358      0.6
[06/23/2021-13:24:55] [I]                                                                       Conv_32 + Relu_33      450.92          40.9929      6.4
[06/23/2021-13:24:55] [I]                                                                       Conv_34 + Relu_35       89.53           8.1391      1.3
[06/23/2021-13:24:55] [I]                                                                       Conv_36 + Relu_37      447.44          40.6764      6.4
[06/23/2021-13:24:55] [I]                                                              Conv_38 + Add_39 + Relu_40       98.98           8.9981      1.4
[06/23/2021-13:24:55] [I]                 Reformatting CopyNode for Output Tensor 0 to Conv_38 + Add_39 + Relu_40        9.82           0.8929      0.1
[06/23/2021-13:24:55] [I]                                                                       Conv_41 + Relu_42      156.82          14.2564      2.2
[06/23/2021-13:24:55] [I]                                                                       Conv_43 + Relu_44       28.42           2.5833      0.4
[06/23/2021-13:24:55] [I]                                                                       Conv_45 + Relu_46      127.12          11.5566      1.8
[06/23/2021-13:24:55] [I]                                                                                 Conv_47       38.15           3.4682      0.5
[06/23/2021-13:24:55] [I]                                                              Conv_48 + Add_49 + Relu_50        9.71           0.8830      0.1
[06/23/2021-13:24:55] [I]                                                                       Conv_51 + Relu_52      185.15          16.8321      2.6
[06/23/2021-13:24:55] [I]                                                                       Conv_53 + Relu_54       46.87           4.2614      0.7
[06/23/2021-13:24:55] [I]                                                                       Conv_55 + Relu_56      183.94          16.7219      2.6
[06/23/2021-13:24:55] [I]                                                              Conv_57 + Add_58 + Relu_59       48.34           4.3944      0.7
[06/23/2021-13:24:55] [I]                           Reformatting CopyNode for Input Tensor 0 to Conv_60 + Relu_61        1.92           0.1742      0.0
[06/23/2021-13:24:55] [I]                                                                       Conv_60 + Relu_61       44.42           4.0381      0.6
[06/23/2021-13:24:55] [I]                           Reformatting CopyNode for Input Tensor 0 to Conv_62 + Relu_63        2.42           0.2203      0.0
[06/23/2021-13:24:55] [I]                                                                       Conv_62 + Relu_63       18.49           1.6812      0.3
[06/23/2021-13:24:55] [I]                                                                       Conv_64 + Relu_65       80.67           7.3335      1.2
[06/23/2021-13:24:55] [I]                                                                                 Conv_66       18.08           1.6433      0.3
[06/23/2021-13:24:55] [I]                                                              Conv_67 + Add_68 + Relu_69        3.78           0.3439      0.1
[06/23/2021-13:24:55] [I]                                                                       Conv_70 + Relu_71       93.57           8.5066      1.3
[06/23/2021-13:24:55] [I]                                                                       Conv_72 + Relu_73       22.16           2.0149      0.3
[06/23/2021-13:24:55] [I]                                                                       Conv_74 + Relu_75       93.51           8.5008      1.3
[06/23/2021-13:24:55] [I]                                                              Conv_76 + Add_77 + Relu_78       22.57           2.0520      0.3
[06/23/2021-13:24:55] [I]                                                                    GlobalAveragePool_79        1.09           0.0995      0.0
[06/23/2021-13:24:55] [I]  Reformatting CopyNode for Input Tensor 0 to Flatten_80 + (Unnamed Layer* 81) [Shuffle]        0.07           0.0061      0.0
[06/23/2021-13:24:55] [I]                                                                                 Gemm_81        0.17           0.0154      0.0
[06/23/2021-13:24:55] [I]                                                                                   Total     6992.69         635.6991    100.0
[06/23/2021-13:24:55] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8000] # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --best --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noD
ataTransfers --dumpProfile --separateProfileRun
[06/23/2021-13:24:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 827, GPU 1919 (MiB)

And without --best :

/usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile 
--separateProfileRun                                                                                                                                                                                       
&&&& RUNNING TensorRT.trtexec [TensorRT v8000] # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTra
nsfers --dumpProfile --separateProfileRun                                                                                                                                                                  
[06/23/2021-13:25:38] [I] === Model Options ===                                                                                                                                                            
[06/23/2021-13:25:38] [I] Format: ONNX                                                                                                                                                                     
[06/23/2021-13:25:38] [I] Model: resnet.onnx                                                                                                                                                               
[06/23/2021-13:25:38] [I] Output:                                                                                                                                                                          
[06/23/2021-13:25:38] [I] === Build Options ===                                                                                                                                                            
[06/23/2021-13:25:38] [I] Max batch: explicit                                                                                                                                                              
[06/23/2021-13:25:38] [I] Workspace: 5000 MiB                                                                                                                                                              
[06/23/2021-13:25:38] [I] minTiming: 1                                                                                                                                                                     
[06/23/2021-13:25:38] [I] avgTiming: 8                                                                                                                                                                     
[06/23/2021-13:25:38] [I] Precision: FP32                                                                                                                                                                  
[06/23/2021-13:25:38] [I] Calibration:                                                                                                                                                                     
[06/23/2021-13:25:38] [I] Refit: Disabled                                                                                                                                                                  
[06/23/2021-13:25:38] [I] Sparsity: Disabled                                                                                                                                                               
[06/23/2021-13:25:38] [I] Safe mode: Disabled                                                                                                                                                              
[06/23/2021-13:25:38] [I] Enable serialization: Disabled                                                                                                                                                   
[06/23/2021-13:25:38] [I] Save engine: resnet.trt                                                                                                                                                          
[06/23/2021-13:25:38] [I] Load engine:                                                                                                                                                                     
[06/23/2021-13:25:38] [I] NVTX verbosity: 0                                                                                                                                                                
[06/23/2021-13:25:38] [I] Tactic sources: Using default tactic sources                                                                                                                                     
[06/23/2021-13:25:38] [I] timingCacheMode: local                                                                                                                                                           
[06/23/2021-13:25:38] [I] timingCacheFile:                                                                                                                                                                 
[06/23/2021-13:25:38] [I] Input(s): fp32:chw                                                                                                                                                               
[06/23/2021-13:25:38] [I] Output(s): fp32:chw                                                                                                                                                              
[06/23/2021-13:25:38] [I] Input build shapes: model                                                                                                                                                        
[06/23/2021-13:25:38] [I] Input calibration shapes: model                                                                                                                                                  
[06/23/2021-13:25:38] [I] === System Options ===                                                                                                                                                           
[06/23/2021-13:25:38] [I] Device: 0                                                                                                                                                                        
[06/23/2021-13:25:38] [I] DLACore:                                                                                                                                                                         
[06/23/2021-13:25:38] [I] Plugins:                                                                                                                                                                         
[06/23/2021-13:25:38] [I] === Inference Options ===                                                                                                                                                        
[06/23/2021-13:25:38] [I] Batch: Explicit                                                                                                                                                                  
[06/23/2021-13:25:38] [I] Input inference shapes: model                                                                                                                                                    
[06/23/2021-13:25:38] [I] Iterations: 10                                                                                                                                                                   
[06/23/2021-13:25:38] [I] Duration: 3s (+ 200ms warm up)                                                                                                                                                   
[06/23/2021-13:25:38] [I] Sleep time: 0ms                                                                                                                                                                  
[06/23/2021-13:25:38] [I] Streams: 1                                                                                                                                                                       
[06/23/2021-13:25:38] [I] ExposeDMA: Disabled                                                                                                                                                              
[06/23/2021-13:25:38] [I] Data transfers: Disabled                                                                                                                                                         
[06/23/2021-13:25:38] [I] Spin-wait: Disabled                                                                                                                                                              
[06/23/2021-13:25:38] [I] Multithreading: Disabled                                                                                                                                                         
[06/23/2021-13:25:38] [I] CUDA Graph: Disabled                                                                                                                                                             
[06/23/2021-13:25:38] [I] Separate profiling: Enabled                                                                                                                                                      
[06/23/2021-13:25:38] [I] Time Deserialize: Disabled                                                                                                                                                       
[06/23/2021-13:25:38] [I] Time Refit: Disabled                                                                                                                                                             
[06/23/2021-13:25:38] [I] Skip inference: Disabled                                                                                                                                                         
[06/23/2021-13:25:38] [I] Inputs:                                                                                                                                                                          
[06/23/2021-13:25:38] [I] === Reporting Options ===                                                                                                                                                        
[06/23/2021-13:25:38] [I] Verbose: Disabled                                                                                                                                                                
[06/23/2021-13:25:38] [I] Averages: 10 inferences                                                                                                                                                          
[06/23/2021-13:25:38] [I] Percentile: 99                                                                                                                                                                   
[06/23/2021-13:25:38] [I] Dump refittable layers:Disabled                                                                                                                                                  
[06/23/2021-13:25:38] [I] Dump output: Disabled                                                                                                                                                            
[06/23/2021-13:25:38] [I] Profile: Enabled                                                                                                                                                                 
[06/23/2021-13:25:38] [I] Export timing to JSON file:                                                                                                                                                      
[06/23/2021-13:25:38] [I] Export output to JSON file:                                                                                                                                                      
[06/23/2021-13:25:38] [I] Export profile to JSON file:                                                                                                                                                     
[06/23/2021-13:25:38] [I]                                                                                                                                                                                  
[06/23/2021-13:25:38] [I] === Device Information ===                                                                                                                                                       
[06/23/2021-13:25:38] [I] Selected Device: GeForce GTX 1080 Ti                                                                                                                                             
[06/23/2021-13:25:38] [I] Compute Capability: 6.1                                                                                                                                                          
[06/23/2021-13:25:38] [I] SMs: 28                                                                                                                                                                          
[06/23/2021-13:25:38] [I] Compute Clock Rate: 1.582 GHz                                                                                                                                                    
[06/23/2021-13:25:38] [I] Device Global Memory: 11176 MiB                                                                                                                                                  
[06/23/2021-13:25:38] [I] Shared Memory per SM: 96 KiB                                                                                                                                                     
[06/23/2021-13:25:38] [I] Memory Bus Width: 352 bits (ECC disabled)                                                                                                                                        
[06/23/2021-13:25:38] [I] Memory Clock Rate: 5.505 GHz                                                                                                                                                     
[06/23/2021-13:25:38] [I]                                                                                                                                                                                  
[06/23/2021-13:25:38] [I] TensorRT version: 8000                                                                                                                                                           
[06/23/2021-13:25:38] [I] [TRT] [MemUsageChange] Init CUDA: CPU +159, GPU +0, now: CPU 165, GPU 215 (MiB)                                                                                                  
[06/23/2021-13:25:38] [I] [TRT] ----------------------------------------------------------------                                                                                                           
[06/23/2021-13:25:38] [I] [TRT] Input filename:   resnet.onnx                                                                                                                                              
[06/23/2021-13:25:38] [I] [TRT] ONNX IR version:  0.0.6                                                                                                                                                    
[06/23/2021-13:25:38] [I] [TRT] Opset version:    9                                                                                                                                                        
[06/23/2021-13:25:38] [I] [TRT] Producer name:    pytorch                                                                                                                                                  
[06/23/2021-13:25:38] [I] [TRT] Producer version: 1.7                                                                                                                                                      
[06/23/2021-13:25:38] [I] [TRT] Domain:                                                                                                                                                                    
[06/23/2021-13:25:38] [I] [TRT] Model version:    0                                                                                                                                                        
[06/23/2021-13:25:38] [I] [TRT] Doc string:                                                                                                                                                                
[06/23/2021-13:25:38] [I] [TRT] ----------------------------------------------------------------                                                                                                           
[06/23/2021-13:25:38] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 245, GPU 215 (MiB)                                                                                                    
[06/23/2021-13:25:38] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 245 MiB, GPU 215 MiB                                                                                                                 
[06/23/2021-13:25:38] [W] [TRT] Convolution + generic activation fusion is disable due to incompatible driver or nvrtc                                                                                     
[06/23/2021-13:25:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +234, GPU +94, now: CPU 479, GPU 309 (MiB)                                                                                      
[06/23/2021-13:25:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +188, GPU +84, now: CPU 667, GPU 393 (MiB)                                                                                                
[06/23/2021-13:25:39] [W] [TRT] Detected invalid timing cache, setup a local cache instead                                                                                                                 
[06/23/2021-13:26:14] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.                              
[06/23/2021-13:26:16] [I] [TRT] Detected 1 inputs and 1 output network tensors.                                                                                                                            
[06/23/2021-13:26:16] [I] [TRT] Total Host Persistent Memory: 1536                                                                                                                                         
[06/23/2021-13:26:16] [I] [TRT] Total Device Persistent Memory: 0                                                                                                                                          
[06/23/2021-13:26:16] [I] [TRT] Total Scratch Memory: 1014681600                                                                                                                                           
[06/23/2021-13:26:16] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB                                                                                   
[06/23/2021-13:26:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 855, GPU 587 (MiB)                                                                                         
[06/23/2021-13:26:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 855, GPU 595 (MiB)                                                                                                   
[06/23/2021-13:26:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 855, GPU 579 (MiB)                                                                                         
[06/23/2021-13:26:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 855, GPU 561 (MiB)                                                                                         
[06/23/2021-13:26:16] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 855 MiB, GPU 561 MiB                                                                                                                   
[06/23/2021-13:26:17] [I] Engine built in 38.7099 sec.                                                                                                                                                     
[06/23/2021-13:26:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 774, GPU 571 (MiB)                                                                                        
[06/23/2021-13:26:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 775, GPU 579 (MiB)                                                                                                   
[06/23/2021-13:26:17] [I] Created input binding for 0 with dimensions 4x3x35x224x224                                                                                                                       
[06/23/2021-13:26:17] [I] Created output binding for 343 with dimensions 4x400                                                                                                                             
[06/23/2021-13:26:17] [I] Starting inference                                                                                                                                                               
[06/23/2021-13:26:25] [I] Warmup completed 1 queries over 200 ms                                                                                                                                           
[06/23/2021-13:26:25] [I] Timing trace has 10 queries over 7.64667 s                                                                                                                                       
[06/23/2021-13:26:25] [I]                                                                                                                                                                                  
[06/23/2021-13:26:25] [I] === Trace details ===                                                                                                                                                            
[06/23/2021-13:26:25] [I] Trace averages of 10 runs:                                                                                                                                                       
[06/23/2021-13:26:25] [I] Average on 10 runs - GPU latency: 764.665 ms - Host latency: 764.665 ms (end to end 764.665 ms, enqueue 1.9593 ms)                                                               
[06/23/2021-13:26:25] [I]                                                                                                                                                                                  
[06/23/2021-13:26:25] [I] === Performance summary ===                                                                                                                                                      
[06/23/2021-13:26:25] [I] Throughput: 1.30776 qps                                                                                                                                                          
[06/23/2021-13:26:25] [I] Latency: min = 759.987 ms, max = 767.528 ms, mean = 764.665 ms, median = 764.687 ms, percentile(99%) = 767.528 ms                                                                
[06/23/2021-13:26:25] [I] End-to-End Host Latency: min = 759.987 ms, max = 767.528 ms, mean = 764.665 ms, median = 764.687 ms, percentile(99%) = 767.528 ms                                                
[06/23/2021-13:26:25] [I] Enqueue Time: min = 1.78426 ms, max = 2.03662 ms, mean = 1.9593 ms, median = 1.98743 ms, percentile(99%) = 2.03662 ms                                                            
[06/23/2021-13:26:25] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms                                                                                          
[06/23/2021-13:26:25] [I] GPU Compute Time: min = 759.987 ms, max = 767.528 ms, mean = 764.665 ms, median = 764.687 ms, percentile(99%) = 767.528 ms                                                       
[06/23/2021-13:26:25] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms                                                                                          
[06/23/2021-13:26:25] [I] Total Host Walltime: 7.64667 s                                                                                                                                                   
[06/23/2021-13:26:25] [I] Total GPU Compute Time: 7.64665 s                                                                                                                                                
[06/23/2021-13:26:25] [I] Explanations of the performance metrics are printed in the verbose logs.                                                                                                         
[06/23/2021-13:26:25] [I]                                                                                                                                                                                  
[06/23/2021-13:26:34] [I]                                                                                                                                                                                  
[06/23/2021-13:26:34] [I] === Profile (11 iterations ) ===                                                                                                                                                 
[06/23/2021-13:26:34] [I]                       Layer   Time (ms)   Avg. Time (ms)   Time %                                                                                                                
[06/23/2021-13:26:34] [I]             Conv_0 + Relu_1      179.34          16.3033      2.1                                                                                                                
[06/23/2021-13:26:34] [I]             Conv_2 + Relu_3      208.35          18.9407      2.5                                                                                                                
[06/23/2021-13:26:34] [I]             Conv_4 + Relu_5      662.98          60.2711      7.9                                                                                                                
[06/23/2021-13:26:34] [I]             Conv_6 + Relu_7      496.22          45.1105      5.9                                                                                                                
[06/23/2021-13:26:34] [I]             Conv_8 + Relu_9      668.58          60.7802      7.9                                                                                                                
[06/23/2021-13:26:34] [I]  Conv_10 + Add_11 + Relu_12      533.84          48.5308      6.3
[06/23/2021-13:26:34] [I]           Conv_13 + Relu_14      663.27          60.2971      7.9
[06/23/2021-13:26:34] [I]           Conv_15 + Relu_16      494.74          44.9765      5.9
[06/23/2021-13:26:34] [I]             Conv_8 + Relu_9      668.58          60.7802      7.9
[06/23/2021-13:26:34] [I]  Conv_10 + Add_11 + Relu_12      533.84          48.5308      6.3
[06/23/2021-13:26:34] [I]           Conv_13 + Relu_14      663.27          60.2971      7.9
[06/23/2021-13:26:34] [I]           Conv_15 + Relu_16      494.74          44.9765      5.9
[06/23/2021-13:26:34] [I]           Conv_17 + Relu_18      667.25          60.6595      7.9
[06/23/2021-13:26:34] [I]  Conv_19 + Add_20 + Relu_21      531.30          48.2997      6.3
[06/23/2021-13:26:34] [I]           Conv_22 + Relu_23      337.54          30.6856      4.0
[06/23/2021-13:26:34] [I]           Conv_24 + Relu_25       78.31           7.1188      0.9
[06/23/2021-13:26:34] [I]           Conv_26 + Relu_27      305.24          27.7490      3.6
[06/23/2021-13:26:34] [I]                     Conv_28       71.01           6.4558      0.8
[06/23/2021-13:26:34] [I]  Conv_29 + Add_30 + Relu_31       39.64           3.6039      0.5
[06/23/2021-13:26:34] [I]           Conv_32 + Relu_33      446.94          40.6308      5.3
[06/23/2021-13:26:34] [I]           Conv_34 + Relu_35       88.57           8.0518      1.1
[06/23/2021-13:26:34] [I]           Conv_36 + Relu_37      443.04          40.2766      5.3
[06/23/2021-13:26:34] [I]  Conv_38 + Add_39 + Relu_40       99.77           9.0698      1.2
[06/23/2021-13:26:34] [I]           Conv_41 + Relu_42      239.33          21.7572      2.8
[06/23/2021-13:26:34] [I]           Conv_43 + Relu_44       30.69           2.7897      0.4
[06/23/2021-13:26:34] [I]           Conv_45 + Relu_46      128.52          11.6840      1.5
[06/23/2021-13:26:34] [I]                     Conv_47       46.40           4.2186      0.6
[06/23/2021-13:26:34] [I]  Conv_48 + Add_49 + Relu_50       17.43           1.5845      0.2
[06/23/2021-13:26:34] [I]           Conv_51 + Relu_52      187.69          17.0626      2.2
[06/23/2021-13:26:34] [I]           Conv_53 + Relu_54       59.29           5.3900      0.7
[06/23/2021-13:26:34] [I]           Conv_55 + Relu_56      188.16          17.1051      2.2
[06/23/2021-13:26:34] [I]  Conv_57 + Add_58 + Relu_59       62.02           5.6380      0.7
[06/23/2021-13:26:34] [I]           Conv_60 + Relu_61       45.43           4.1304      0.5
[06/23/2021-13:26:34] [I]           Conv_62 + Relu_63       30.29           2.7533      0.4
[06/23/2021-13:26:34] [I]           Conv_64 + Relu_65       80.24           7.2947      1.0
[06/23/2021-13:26:34] [I]                     Conv_66       30.02           2.7289      0.4
[06/23/2021-13:26:34] [I]  Conv_67 + Add_68 + Relu_69        5.18           0.4705      0.1
[06/23/2021-13:26:34] [I]           Conv_70 + Relu_71       93.26           8.4783      1.1
[06/23/2021-13:26:34] [I]           Conv_72 + Relu_73       36.31           3.3012      0.4
[06/23/2021-13:26:34] [I]           Conv_74 + Relu_75       93.49           8.4995      1.1
[06/23/2021-13:26:34] [I]  Conv_76 + Add_77 + Relu_78       37.64           3.4216      0.4
[06/23/2021-13:26:34] [I]        GlobalAveragePool_79        1.58           0.1435      0.0
[06/23/2021-13:26:34] [I]                     Gemm_81        0.18           0.0160      0.0
[06/23/2021-13:26:34] [I]                       Total     8429.07         766.2787    100.0
[06/23/2021-13:26:34] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8000] # /usr/src/tensorrt/bin/trtexec --onnx=resnet.onnx --workspace=5000 --saveEngine=resnet.trt --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --noDataTransfers --dumpProfile --separateProfileRun
[06/23/2021-13:26:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 774, GPU 3463 (MiB)
dmenig commented 3 years ago

Will that help you guys in figuring out this performance regression ?

dmenig commented 3 years ago

I wonder if this is related to this. It's the same GPU, and the performance drop is comparable.

If it is, it would mean cudnn 8.2.1 solves the issue, see https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-821 line "Known regressions on certain layers in cuDNN 8 regression in algorithm selection heuristics have been fixed on Volta and Pascal platforms."

Since it's in the 21.06 image just released, I'll take a look.

ttyio commented 3 years ago

@hyperfraise thanks for sharing and sorry for the delay response, I have created internal issue to track this regression.

dmenig commented 3 years ago

I think I was right. I tested 21.06, which has cudnn 8.2.1, and the problem seems solved : 21.06 : 535.982 ms

20.11 : 563.797 ms

(so there's even a little boost)

Thus closing this.