NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.81k stars 2.13k forks source link

The latency difference between bilinear gridsample and nearest is significant in the same onnx model #4246

Open DaZhUUU opened 2 days ago

DaZhUUU commented 2 days ago

Environment

TensorRT Version:8.6.2

NVIDIA GPU:Orin

NVIDIA Driver Version:

CUDA Version:12.2

CUDNN Version: 8904

Description

I have a onnx model. There are some gridsample operators in this model. I use /usr/src/tensorrt/bin/trtexec tool to build the model and test performance in Orin. The command like this:

/usr/src/tensorrt/bin/trtexec --useSpinWait --useCudaGraph --onnx=model.onnx --saveEngine=model.trt --fp16

Here is my problem: 1) When I set the attribute 'mode' of gridsample to 'bilinear', the latency is 1900+ms. (I know there must be something wrong) 2) When I set the attribute 'mode' of gridsample to 0, the latency is 680ms. It's a normal latency. I know I shouldn't set mode to 0, but I see in following code, if the 'mode' is not 'bilinear', 'nearest' or 'bicubic', 'interpolationMode' will be 'kNEAREST'.

https://github.com/onnx/onnx-tensorrt/blob/7583da4c62475e84b7be31f4b8fb0c101873d434/builtin_op_importers.cpp#L4386

### So why the latency in these two scenarios is significantly different?

And when I want to get the profiling of model with nsight-compute, an error occurs halfway through execution.

sudo -E /opt/nvidia/nsight-compute/2023.2.2/ncu --set full -f -o profile_6 /usr/src/tensorrt/bin/trtexec --useSpinWait --useCudaGraph --loadEngine=model.trt --fp16
....
....
==PROF== Profiling "copyVectorizedKernel" - 479: 0%....50%....100% - 34 passes
==PROF== Profiling "generatedNativePointwise" - 480: NVMAP_IOC_GET_FD failed: Bad address
PosixMemMap:74 FD from Handle failed : Bad address
NVMAP_IOC_GET_FD failed: Bad address
0%....50%....100% - 34 passes
==PROF== Profiling "__myl_bb0_23_ConSli" - 481: 0%....50%....
==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "__myl_bb0_23_ConSli" in process 25435
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.

I download new version nsight-compute from next website like 2023.3.0 and 2024.3.2, some errors also occurr and it can't start profiling. https://developer.nvidia.com/tools-downloads#?dn=nsight-compute-2024-3-2

sudo -E /usr/local/ NVIDIA-Nsight-Compute-2023.3/ncu --set full -f -o profile_6_new /usr/src/tensorrt/bin/trtexec --useSpinWait --useCudaGraph --loadEngine=model.trt --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --useSpinWait --useCudaGraph --loadEngine=model.trt --fp16
[08/15/2024-17:32:41] [I] === Model Options ===
[08/15/2024-17:32:41] [I] Format: *
[08/15/2024-17:32:41] [I] Model:
[08/15/2024-17:32:41] [I] Output:
[08/15/2024-17:32:41] [I] === Build Options ===
[08/15/2024-17:32:41] [I] Max batch: 1
[08/15/2024-17:32:41] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/15/2024-17:32:41] [I] minTiming: 1
[08/15/2024-17:32:41] [I] avgTiming: 8
[08/15/2024-17:32:41] [I] Precision: FP32+FP16
[08/15/2024-17:32:41] [I] LayerPrecisions:
[08/15/2024-17:32:41] [I] Layer Device Types:
[08/15/2024-17:32:41] [I] Calibration:
[08/15/2024-17:32:41] [I] Refit: Disabled
[08/15/2024-17:32:41] [I] Version Compatible: Disabled
[08/15/2024-17:32:41] [I] ONNX Native InstanceNorm: Disabled
[08/15/2024-17:32:41] [I] TensorRT runtime: full
[08/15/2024-17:32:41] [I] Lean DLL Path:
[08/15/2024-17:32:41] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/15/2024-17:32:41] [I] Exclude Lean Runtime: Disabled
[08/15/2024-17:32:41] [I] Sparsity: Disabled
[08/15/2024-17:32:41] [I] Safe mode: Disabled
[08/15/2024-17:32:41] [I] Build DLA standalone loadable: Disabled
[08/15/2024-17:32:41] [I] Allow GPU fallback for DLA: Disabled
[08/15/2024-17:32:41] [I] DirectIO mode: Disabled
[08/15/2024-17:32:41] [I] Restricted mode: Disabled
[08/15/2024-17:32:41] [I] Skip inference: Disabled
[08/15/2024-17:32:41] [I] Save engine:
[08/15/2024-17:32:41] [I] Load engine: model.trt
[08/15/2024-17:32:41] [I] Profiling verbosity: 0
[08/15/2024-17:32:41] [I] Tactic sources: Using default tactic sources
[08/15/2024-17:32:41] [I] timingCacheMode: local
[08/15/2024-17:32:41] [I] timingCacheFile:
[08/15/2024-17:32:41] [I] Heuristic: Disabled
[08/15/2024-17:32:41] [I] Preview Features: Use default preview flags.
[08/15/2024-17:32:41] [I] MaxAuxStreams: -1
[08/15/2024-17:32:41] [I] BuilderOptimizationLevel: -1
[08/15/2024-17:32:41] [I] Input(s)s format: fp32:CHW
[08/15/2024-17:32:41] [I] Output(s)s format: fp32:CHW
[08/15/2024-17:32:41] [I] Input build shapes: model
[08/15/2024-17:32:41] [I] Input calibration shapes: model
[08/15/2024-17:32:41] [I] === System Options ===
[08/15/2024-17:32:41] [I] Device: 0
[08/15/2024-17:32:41] [I] DLACore:
[08/15/2024-17:32:41] [I] Plugins:
[08/15/2024-17:32:41] [I] setPluginsToSerialize:
[08/15/2024-17:32:41] [I] dynamicPlugins:
[08/15/2024-17:32:41] [I] ignoreParsedPluginLibs: 0
[08/15/2024-17:32:41] [I]
[08/15/2024-17:32:41] [I] === Inference Options ===
[08/15/2024-17:32:41] [I] Batch: 1
[08/15/2024-17:32:41] [I] Input inference shapes: model
[08/15/2024-17:32:41] [I] Iterations: 10
[08/15/2024-17:32:41] [I] Duration: 3s (+ 200ms warm up)
[08/15/2024-17:32:41] [I] Sleep time: 0ms
[08/15/2024-17:32:41] [I] Idle time: 0ms
[08/15/2024-17:32:41] [I] Inference Streams: 1
[08/15/2024-17:32:41] [I] ExposeDMA: Disabled
[08/15/2024-17:32:41] [I] Data transfers: Enabled
[08/15/2024-17:32:41] [I] Spin-wait: Enabled
[08/15/2024-17:32:41] [I] Multithreading: Disabled
[08/15/2024-17:32:41] [I] CUDA Graph: Enabled
[08/15/2024-17:32:41] [I] Separate profiling: Disabled
[08/15/2024-17:32:41] [I] Time Deserialize: Disabled
[08/15/2024-17:32:41] [I] Time Refit: Disabled
[08/15/2024-17:32:41] [I] NVTX verbosity: 0
[08/15/2024-17:32:41] [I] Persistent Cache Ratio: 0
[08/15/2024-17:32:41] [I] Inputs:
[08/15/2024-17:32:41] [I] === Reporting Options ===
[08/15/2024-17:32:41] [I] Verbose: Disabled
[08/15/2024-17:32:41] [I] Averages: 10 inferences
[08/15/2024-17:32:41] [I] Percentiles: 90,95,99
[08/15/2024-17:32:41] [I] Dump refittable layers:Disabled
[08/15/2024-17:32:41] [I] Dump output: Disabled
[08/15/2024-17:32:41] [I] Profile: Disabled
[08/15/2024-17:32:41] [I] Export timing to JSON file:
[08/15/2024-17:32:41] [I] Export output to JSON file:
[08/15/2024-17:32:41] [I] Export profile to JSON file:
[08/15/2024-17:32:41] [I]
==PROF== Connected to process 43858 (/usr/src/tensorrt/bin/trtexec)
[08/15/2024-17:32:41] [I] === Device Information ===
[08/15/2024-17:32:41] [I] Selected Device: Orin
[08/15/2024-17:32:41] [I] Compute Capability: 8.7
[08/15/2024-17:32:41] [I] SMs: 16
[08/15/2024-17:32:41] [I] Device Global Memory: 30697 MiB
[08/15/2024-17:32:41] [I] Shared Memory per SM: 164 KiB
[08/15/2024-17:32:41] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/15/2024-17:32:41] [I] Application Compute Clock Rate: 1.3 GHz
[08/15/2024-17:32:41] [I] Application Memory Clock Rate: 1.3 GHz
[08/15/2024-17:32:41] [I]
[08/15/2024-17:32:41] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[08/15/2024-17:32:41] [I]
[08/15/2024-17:32:41] [I] TensorRT version: 8.6.2
[08/15/2024-17:32:41] [I] Loading standard plugins
[08/15/2024-17:32:41] [I] Engine loaded in 0.09046 sec.
[08/15/2024-17:32:41] [I] [TRT] Loaded engine size: 146 MiB
[08/15/2024-17:32:41] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +143, now: CPU 0, GPU 143 (MiB)
[08/15/2024-17:32:41] [I] Engine deserialized in 0.0713967 sec.
[08/15/2024-17:32:41] [I] [TRT] [MS] Running engine with multi stream info
[08/15/2024-17:32:41] [I] [TRT] [MS] Number of aux streams is 7
[08/15/2024-17:32:41] [I] [TRT] [MS] Number of total worker streams is 8
[08/15/2024-17:32:41] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[08/15/2024-17:32:43] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +7760, now: CPU 0, GPU 7903 (MiB)
[08/15/2024-17:32:43] [I] Setting persistentCacheLimit to 0 bytes.
[08/15/2024-17:32:43] [I] Using random values for input img
[08/15/2024-17:32:43] [I] Input binding for img with dimensions 6x3x928x1600 is created.
[08/15/2024-17:32:43] [I] Using random values for input ref_2d
[08/15/2024-17:32:43] [I] Input binding for ref_2d with dimensions 1x40000x1x2 is created.
[08/15/2024-17:32:43] [I] Using random values for input reference_points_cam
[08/15/2024-17:32:43] [I] Input binding for reference_points_cam with dimensions 6x1x40000x4x2 is created.
[08/15/2024-17:32:43] [I] Using random values for input shift
[08/15/2024-17:32:43] [I] Input binding for shift with dimensions 1x2 is created.
[08/15/2024-17:32:43] [I] Using random values for input can_bus
[08/15/2024-17:32:43] [I] Input binding for can_bus with dimensions 1x18 is created.
[08/15/2024-17:32:43] [I] Output binding for bev_embed with dimensions 1x40000x256 is created.
[08/15/2024-17:32:43] [I] Output binding for pred_logits with dimensions 6x1x901x10 is created.
[08/15/2024-17:32:43] [I] Output binding for pred_boxes with dimensions 6x1x901x10 is created.
[08/15/2024-17:32:43] [I] Output binding for pred_past_trajs with dimensions 6x1x901x8x2 is created.
[08/15/2024-17:32:43] [I] Output binding for ref_pts with dimensions 1x901x3 is created.
[08/15/2024-17:32:43] [I] Starting inference
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile "permutationKernelPLC3" in process 43858
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
sudo -E /usr/local/NVIDIA-Nsight-Compute-2024.3/ncu --set full -f -o profile_6_new /usr/src/tensorrt/bin/trtexec --useSpinWait --useCudaGraph --loadEngine=model.trt --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --useSpinWait --useCudaGraph --loadEngine=model.trt --fp16
[08/15/2024-17:35:13] [I] === Model Options ===
[08/15/2024-17:35:13] [I] Format: *
[08/15/2024-17:35:13] [I] Model:
[08/15/2024-17:35:13] [I] Output:
[08/15/2024-17:35:13] [I] === Build Options ===
[08/15/2024-17:35:13] [I] Max batch: 1
[08/15/2024-17:35:13] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/15/2024-17:35:13] [I] minTiming: 1
[08/15/2024-17:35:13] [I] avgTiming: 8
[08/15/2024-17:35:13] [I] Precision: FP32+FP16
[08/15/2024-17:35:13] [I] LayerPrecisions:
[08/15/2024-17:35:13] [I] Layer Device Types:
[08/15/2024-17:35:13] [I] Calibration:
[08/15/2024-17:35:13] [I] Refit: Disabled
[08/15/2024-17:35:13] [I] Version Compatible: Disabled
[08/15/2024-17:35:13] [I] ONNX Native InstanceNorm: Disabled
[08/15/2024-17:35:13] [I] TensorRT runtime: full
[08/15/2024-17:35:13] [I] Lean DLL Path:
[08/15/2024-17:35:13] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/15/2024-17:35:13] [I] Exclude Lean Runtime: Disabled
[08/15/2024-17:35:13] [I] Sparsity: Disabled
[08/15/2024-17:35:13] [I] Safe mode: Disabled
[08/15/2024-17:35:13] [I] Build DLA standalone loadable: Disabled
[08/15/2024-17:35:13] [I] Allow GPU fallback for DLA: Disabled
[08/15/2024-17:35:13] [I] DirectIO mode: Disabled
[08/15/2024-17:35:13] [I] Restricted mode: Disabled
[08/15/2024-17:35:13] [I] Skip inference: Disabled
[08/15/2024-17:35:13] [I] Save engine:
[08/15/2024-17:35:13] [I] Load engine: model.trt
[08/15/2024-17:35:13] [I] Profiling verbosity: 0
[08/15/2024-17:35:13] [I] Tactic sources: Using default tactic sources
[08/15/2024-17:35:13] [I] timingCacheMode: local
[08/15/2024-17:35:13] [I] timingCacheFile:
[08/15/2024-17:35:13] [I] Heuristic: Disabled
[08/15/2024-17:35:13] [I] Preview Features: Use default preview flags.
[08/15/2024-17:35:13] [I] MaxAuxStreams: -1
[08/15/2024-17:35:13] [I] BuilderOptimizationLevel: -1
[08/15/2024-17:35:13] [I] Input(s)s format: fp32:CHW
[08/15/2024-17:35:13] [I] Output(s)s format: fp32:CHW
[08/15/2024-17:35:13] [I] Input build shapes: model
[08/15/2024-17:35:13] [I] Input calibration shapes: model
[08/15/2024-17:35:13] [I] === System Options ===
[08/15/2024-17:35:13] [I] Device: 0
[08/15/2024-17:35:13] [I] DLACore:
[08/15/2024-17:35:13] [I] Plugins:
[08/15/2024-17:35:13] [I] setPluginsToSerialize:
[08/15/2024-17:35:13] [I] dynamicPlugins:
[08/15/2024-17:35:13] [I] ignoreParsedPluginLibs: 0
[08/15/2024-17:35:13] [I]
[08/15/2024-17:35:13] [I] === Inference Options ===
[08/15/2024-17:35:13] [I] Batch: 1
[08/15/2024-17:35:13] [I] Input inference shapes: model
[08/15/2024-17:35:13] [I] Iterations: 10
[08/15/2024-17:35:13] [I] Duration: 3s (+ 200ms warm up)
[08/15/2024-17:35:13] [I] Sleep time: 0ms
[08/15/2024-17:35:13] [I] Idle time: 0ms
[08/15/2024-17:35:13] [I] Inference Streams: 1
[08/15/2024-17:35:13] [I] ExposeDMA: Disabled
[08/15/2024-17:35:13] [I] Data transfers: Enabled
[08/15/2024-17:35:13] [I] Spin-wait: Enabled
[08/15/2024-17:35:13] [I] Multithreading: Disabled
[08/15/2024-17:35:13] [I] CUDA Graph: Enabled
[08/15/2024-17:35:13] [I] Separate profiling: Disabled
[08/15/2024-17:35:13] [I] Time Deserialize: Disabled
[08/15/2024-17:35:13] [I] Time Refit: Disabled
[08/15/2024-17:35:13] [I] NVTX verbosity: 0
[08/15/2024-17:35:13] [I] Persistent Cache Ratio: 0
[08/15/2024-17:35:13] [I] Inputs:
[08/15/2024-17:35:13] [I] === Reporting Options ===
[08/15/2024-17:35:13] [I] Verbose: Disabled
[08/15/2024-17:35:13] [I] Averages: 10 inferences
[08/15/2024-17:35:13] [I] Percentiles: 90,95,99
[08/15/2024-17:35:13] [I] Dump refittable layers:Disabled
[08/15/2024-17:35:13] [I] Dump output: Disabled
[08/15/2024-17:35:13] [I] Profile: Disabled
[08/15/2024-17:35:13] [I] Export timing to JSON file:
[08/15/2024-17:35:13] [I] Export output to JSON file:
[08/15/2024-17:35:13] [I] Export profile to JSON file:
[08/15/2024-17:35:13] [I]
==PROF== Connected to process 44575 (/usr/src/tensorrt/bin/trtexec)
==ERROR== The application returned an error code (11).
DaZhUUU commented 1 day ago

I tested with the ONNX model fragment, and indeed, the performance of bilinear is worse, approximately 200ms. The performance of nearest is approximately 75ms.

The tested model comes from the following code--"multi_scale_deformable_attn_pytorch" function https://github.com/facebookresearch/sapiens/blob/3e829ac27476e4a70b6a01f85e487492afe02df1/cv/mmcv/ops/multi_scale_deform_attn.py#L114

lix19937 commented 19 hours ago

Obviously, the algorithm complexity of nearest is less than that of bilinear.

DaZhUUU commented 19 hours ago

Obviously, the algorithm complexity of nearest is less than that of bilinear.

I know. But the gap is too large and this is abnormal

lix19937 commented 19 hours ago

Can you upload the two subgraph onnxs ? ( grid_sample with bilinear + grid_sample with nearest )

DaZhUUU commented 19 hours ago

Can you upload the two subgraph onnxs ? ( grid_sample with bilinear + grid_sample with nearest )

I don't know why my image uploads always fail. It could be that there are some issues with my internet connection.

The model just likes the code, some rehsape, some gather. The value_spatial_shapes is [[116, 200], [58, 100], [29, 50], [15, 25]]

https://github.com/facebookresearch/sapiens/blob/3e829ac27476e4a70b6a01f85e487492afe02df1/cv/mmcv/ops/multi_scale_deform_attn.py#L114