NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.62k stars 2.11k forks source link

`BatchedNMSPlugin::getOutputDimensions()` reports wrong `num_detections` shape #1880

Closed maminus closed 2 years ago

maminus commented 2 years ago

Description

In the specification, num_detections has shape of [batch_size, 1].

But BatchedNMSPlugin::getOutputDimensions() reports shape of [batch_size].

The following is diff of trtexec log.(actual vs expected)

-[V] [TRT] node_of_num_detections [BatchedNMS_TRT] outputs: [num_detections -> (16)[INT32]], [nmsed_boxes -> (16, 50, 4)[FLOAT]], [nmsed_scores -> (16, 50)[FLOAT]], [nmsed_classes -> (16, 50)[FLOAT]], 
+[V] [TRT] node_of_num_detections [BatchedNMS_TRT] outputs: [num_detections -> (16, 1)[INT32]], [nmsed_boxes -> (16, 50, 4)[FLOAT]], [nmsed_scores -> (16, 50)[FLOAT]], [nmsed_classes -> (16, 50)[FLOAT]], 

now batch_size=16, keepTopK=50. So num_detections expected shape of (16, 1), but actually shape of (16).

It seems BatchedNMSPlugin::getOutputDimensions() expects following code.

         // num_detections
         if (index == 0)
         {
             Dims dim0{};
-            dim0.nbDims = 0;
+            dim0.nbDims = 1;
+            dim0.d[0] = 1;
             return dim0;
         }

Environment

TensorRT Version: 8.0.1 NVIDIA GPU: Jetson AGX Xavier Developer Kit NVIDIA Driver Version: N/A CUDA Version: 10.2 CUDNN Version: 8.2.1 Operating System: L4T 32.6.1 Python Version (if applicable): 3.6.9 Tensorflow Version (if applicable): 2.6.2+nv21.12 PyTorch Version (if applicable): 1.8.1 Baremetal or Container (if so, version): Container l4t-base r32.6.1

Relevant Files

no external links because all necessary scripts and logs are showed at next section(Steps To Reproduce).

Steps To Reproduce

  1. run following python script to generate minimal reproducable ONNX file.
# generate minimal_nms.onnx in current directory.
$ python3 generate_minimal_reproduce_model.py
generate_minimal_reproduce_model.py ```python import onnx import onnx.numpy_helper import numpy as np inputs = [ onnx.helper.make_tensor_value_info('boxes', onnx.TensorProto.FLOAT, [16, 100, 1, 4]), onnx.helper.make_tensor_value_info('scores', onnx.TensorProto.FLOAT, [16, 100, 20]), ] outputs = [ onnx.helper.make_tensor_value_info('result_detections', onnx.TensorProto.INT64, [16, 1]) ] # BatchedNMS_TRT -> Add nodes = [ onnx.helper.make_node('BatchedNMS_TRT', ['boxes', 'scores'], ['num_detections', 'nmsed_boxes', 'nmsed_scores', 'nmsed_classes'], shareLocation=1, backgroundLabelId=-1, numClasses=20, topK=100, keepTopK=50, scoreThreshold=0.5, iouThreshold=0.9, isNormalized=0, clipBoxes=0, domain='tensorrt'), # expected shape is Add([16, 1], [16, 1]) -> [16, 1] onnx.helper.make_node('Add', ['num_detections', 'dummy_coeff'], ['result_detections']), ] inits = [ onnx.numpy_helper.from_array(np.zeros([16, 1], dtype=np.int64), 'dummy_coeff'), ] opsets = [ onnx.helper.make_opsetid('', 11), onnx.helper.make_opsetid('tensorrt', 11), ] model = onnx.helper.make_model(onnx.helper.make_graph(nodes, 'minimal_nms', inputs, outputs, inits), opset_imports=opsets, ir_version=7) onnx.checker.check_model(model) onnx.save(model, 'minimal_nms.onnx') ```
  1. run trtexec to convert to engine file.
$ /usr/src/tensorrt/bin/trtexec --onnx=minimal_nms.onnx --workspace=64 --saveEngine=wrongly_shape.plan --buildOnly --verbose

The following is trtexec verbose log.(you can see wrong num_detections shape)

trtexec verbose log ``` nvidia@jetson-agx-xavier:/work$ /usr/src/tensorrt/bin/trtexec --onnx=minimal_nms.onnx --workspace=64 --saveEngine=wrongly_shape.plan --buildOnly --verbose &&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=minimal_nms.onnx --workspace=64 --saveEngine=wrongly_shape.plan --buildOnly --verbose [03/25/2022-13:44:37] [I] === Model Options === [03/25/2022-13:44:37] [I] Format: ONNX [03/25/2022-13:44:37] [I] Model: minimal_nms.onnx [03/25/2022-13:44:37] [I] Output: [03/25/2022-13:44:37] [I] === Build Options === [03/25/2022-13:44:37] [I] Max batch: explicit [03/25/2022-13:44:37] [I] Workspace: 64 MiB [03/25/2022-13:44:37] [I] minTiming: 1 [03/25/2022-13:44:37] [I] avgTiming: 8 [03/25/2022-13:44:37] [I] Precision: FP32 [03/25/2022-13:44:37] [I] Calibration: [03/25/2022-13:44:37] [I] Refit: Disabled [03/25/2022-13:44:37] [I] Sparsity: Disabled [03/25/2022-13:44:37] [I] Safe mode: Disabled [03/25/2022-13:44:37] [I] Restricted mode: Disabled [03/25/2022-13:44:37] [I] Save engine: wrongly_shape.plan [03/25/2022-13:44:37] [I] Load engine: [03/25/2022-13:44:37] [I] NVTX verbosity: 0 [03/25/2022-13:44:37] [I] Tactic sources: Using default tactic sources [03/25/2022-13:44:37] [I] timingCacheMode: local [03/25/2022-13:44:37] [I] timingCacheFile: [03/25/2022-13:44:37] [I] Input(s)s format: fp32:CHW [03/25/2022-13:44:37] [I] Output(s)s format: fp32:CHW [03/25/2022-13:44:37] [I] Input build shapes: model [03/25/2022-13:44:37] [I] Input calibration shapes: model [03/25/2022-13:44:37] [I] === System Options === [03/25/2022-13:44:37] [I] Device: 0 [03/25/2022-13:44:37] [I] DLACore: [03/25/2022-13:44:37] [I] Plugins: [03/25/2022-13:44:37] [I] === Inference Options === [03/25/2022-13:44:37] [I] Batch: Explicit [03/25/2022-13:44:37] [I] Input inference shapes: model [03/25/2022-13:44:37] [I] Iterations: 10 [03/25/2022-13:44:37] [I] Duration: 3s (+ 200ms warm up) [03/25/2022-13:44:37] [I] Sleep time: 0ms [03/25/2022-13:44:37] [I] Streams: 1 [03/25/2022-13:44:37] [I] ExposeDMA: Disabled [03/25/2022-13:44:37] [I] Data transfers: Enabled [03/25/2022-13:44:37] [I] Spin-wait: Disabled [03/25/2022-13:44:37] [I] Multithreading: Disabled [03/25/2022-13:44:37] [I] CUDA Graph: Disabled [03/25/2022-13:44:37] [I] Separate profiling: Disabled [03/25/2022-13:44:37] [I] Time Deserialize: Disabled [03/25/2022-13:44:37] [I] Time Refit: Disabled [03/25/2022-13:44:37] [I] Skip inference: Enabled [03/25/2022-13:44:37] [I] Inputs: [03/25/2022-13:44:37] [I] === Reporting Options === [03/25/2022-13:44:37] [I] Verbose: Enabled [03/25/2022-13:44:37] [I] Averages: 10 inferences [03/25/2022-13:44:37] [I] Percentile: 99 [03/25/2022-13:44:37] [I] Dump refittable layers:Disabled [03/25/2022-13:44:37] [I] Dump output: Disabled [03/25/2022-13:44:37] [I] Profile: Disabled [03/25/2022-13:44:37] [I] Export timing to JSON file: [03/25/2022-13:44:37] [I] Export output to JSON file: [03/25/2022-13:44:37] [I] Export profile to JSON file: [03/25/2022-13:44:37] [I] [03/25/2022-13:44:37] [I] === Device Information === [03/25/2022-13:44:37] [I] Selected Device: Xavier [03/25/2022-13:44:37] [I] Compute Capability: 7.2 [03/25/2022-13:44:37] [I] SMs: 8 [03/25/2022-13:44:37] [I] Compute Clock Rate: 1.377 GHz [03/25/2022-13:44:37] [I] Device Global Memory: 15824 MiB [03/25/2022-13:44:37] [I] Shared Memory per SM: 96 KiB [03/25/2022-13:44:37] [I] Memory Bus Width: 256 bits (ECC disabled) [03/25/2022-13:44:37] [I] Memory Clock Rate: 1.377 GHz [03/25/2022-13:44:37] [I] [03/25/2022-13:44:37] [I] TensorRT version: 8001 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::Region_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::ScatterND version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::CropAndResize version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::Proposal version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::Split version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1 [03/25/2022-13:44:37] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1 [03/25/2022-13:44:40] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 372, GPU 2491 (MiB) [03/25/2022-13:44:40] [I] Start parsing network model [03/25/2022-13:44:40] [I] [TRT] ---------------------------------------------------------------- [03/25/2022-13:44:40] [I] [TRT] Input filename: minimal_nms.onnx [03/25/2022-13:44:40] [I] [TRT] ONNX IR version: 0.0.7 [03/25/2022-13:44:40] [I] [TRT] Opset version: 11 [03/25/2022-13:44:40] [I] [TRT] Producer name: [03/25/2022-13:44:40] [I] [TRT] Producer version: [03/25/2022-13:44:40] [I] [TRT] Domain: [03/25/2022-13:44:40] [I] [TRT] Model version: 0 [03/25/2022-13:44:40] [I] [TRT] Doc string: [03/25/2022-13:44:40] [I] [TRT] ---------------------------------------------------------------- [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::GridAnchor_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::GridAnchorRect_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::NMS_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::Reorg_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::Region_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::Clip_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::LReLU_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::PriorBox_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::Normalize_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::ScatterND version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::RPROI_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::BatchedNMS_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::BatchedNMSDynamic_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::FlattenConcat_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::CropAndResize version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::DetectionLayer_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::EfficientNMS_ONNX_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::EfficientNMS_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::Proposal version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::ProposalLayer_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::PyramidROIAlign_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::ResizeNearest_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::Split version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::SpecialSlice_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Plugin creator already registered - ::InstanceNormalization_TRT version 1 [03/25/2022-13:44:40] [V] [TRT] Adding network input: boxes with dtype: float32, dimensions: (16, 100, 1, 4) [03/25/2022-13:44:40] [V] [TRT] Registering tensor: boxes for ONNX tensor: boxes [03/25/2022-13:44:40] [V] [TRT] Adding network input: scores with dtype: float32, dimensions: (16, 100, 20) [03/25/2022-13:44:40] [V] [TRT] Registering tensor: scores for ONNX tensor: scores [03/25/2022-13:44:40] [V] [TRT] Importing initializer: dummy_coeff [03/25/2022-13:44:40] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [03/25/2022-13:44:40] [V] [TRT] Parsing node: node_of_num_detections [BatchedNMS_TRT] [03/25/2022-13:44:40] [V] [TRT] Searching for input: boxes [03/25/2022-13:44:40] [V] [TRT] Searching for input: scores [03/25/2022-13:44:40] [V] [TRT] node_of_num_detections [BatchedNMS_TRT] inputs: [boxes -> (16, 100, 1, 4)[FLOAT]], [scores -> (16, 100, 20)[FLOAT]], [03/25/2022-13:44:40] [I] [TRT] No importer registered for op: BatchedNMS_TRT. Attempting to import as plugin. [03/25/2022-13:44:40] [I] [TRT] Searching for plugin: BatchedNMS_TRT, plugin_version: 1, plugin_namespace: [03/25/2022-13:44:40] [W] [TRT] builtin_op_importers.cpp:4552: Attribute scoreBits not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build. [03/25/2022-13:44:40] [I] [TRT] Successfully created plugin: BatchedNMS_TRT [03/25/2022-13:44:40] [V] [TRT] Registering layer: node_of_num_detections for ONNX node: node_of_num_detections [03/25/2022-13:44:40] [V] [TRT] Registering tensor: num_detections for ONNX tensor: num_detections [03/25/2022-13:44:40] [V] [TRT] Registering tensor: nmsed_boxes for ONNX tensor: nmsed_boxes [03/25/2022-13:44:40] [V] [TRT] Registering tensor: nmsed_scores for ONNX tensor: nmsed_scores [03/25/2022-13:44:40] [V] [TRT] Registering tensor: nmsed_classes for ONNX tensor: nmsed_classes [03/25/2022-13:44:40] [V] [TRT] node_of_num_detections [BatchedNMS_TRT] outputs: [num_detections -> (16)[INT32]], [nmsed_boxes -> (16, 50, 4)[FLOAT]], [nmsed_scores -> (16, 50)[FLOAT]], [nmsed_classes -> (16, 50)[FLOAT]], [03/25/2022-13:44:40] [V] [TRT] Parsing node: node_of_result_detections [Add] [03/25/2022-13:44:40] [V] [TRT] Searching for input: num_detections [03/25/2022-13:44:40] [V] [TRT] Searching for input: dummy_coeff [03/25/2022-13:44:40] [V] [TRT] node_of_result_detections [Add] inputs: [num_detections -> (16)[INT32]], [dummy_coeff -> (16, 1)[INT32]], [03/25/2022-13:44:40] [V] [TRT] Registering layer: dummy_coeff for ONNX node: dummy_coeff [03/25/2022-13:44:40] [V] [TRT] Registering layer: node_of_result_detections for ONNX node: node_of_result_detections [03/25/2022-13:44:40] [V] [TRT] Registering tensor: result_detections_0 for ONNX tensor: result_detections [03/25/2022-13:44:40] [V] [TRT] node_of_result_detections [Add] outputs: [result_detections -> (16, 16)[INT32]], [03/25/2022-13:44:40] [V] [TRT] Marking result_detections_0 as output: result_detections [03/25/2022-13:44:40] [I] Finish parsing network model [03/25/2022-13:44:40] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 372, GPU 2491 (MiB) [03/25/2022-13:44:40] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 372 MiB, GPU 2491 MiB [03/25/2022-13:44:40] [V] [TRT] Applying generic optimizations to the graph for inference. [03/25/2022-13:44:40] [V] [TRT] Original: 4 layers [03/25/2022-13:44:40] [V] [TRT] After dead-layer removal: 4 layers [03/25/2022-13:44:40] [V] [TRT] After Myelin optimization: 4 layers [03/25/2022-13:44:40] [V] [TRT] After scale fusion: 4 layers [03/25/2022-13:44:40] [V] [TRT] After vertical fusions: 4 layers [03/25/2022-13:44:40] [V] [TRT] After dupe layer removal: 4 layers [03/25/2022-13:44:40] [V] [TRT] After final dead-layer removal: 4 layers [03/25/2022-13:44:40] [V] [TRT] After tensor merging: 4 layers [03/25/2022-13:44:40] [V] [TRT] After concat removal: 4 layers [03/25/2022-13:44:40] [V] [TRT] Graph construction and optimization completed in 0.00391566 seconds. [03/25/2022-13:44:40] [I] [TRT] ---------- Layers Running on DLA ---------- [03/25/2022-13:44:40] [I] [TRT] ---------- Layers Running on GPU ---------- [03/25/2022-13:44:40] [I] [TRT] [GpuLayer] node_of_num_detections [03/25/2022-13:44:40] [I] [TRT] [GpuLayer] (Unnamed Layer* 1) [Shuffle] [03/25/2022-13:44:40] [I] [TRT] [GpuLayer] dummy_coeff [03/25/2022-13:44:40] [I] [TRT] [GpuLayer] node_of_result_detections [03/25/2022-13:44:41] [V] [TRT] Using cublas a tactic source [03/25/2022-13:44:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +376, now: CPU 598, GPU 2867 (MiB) [03/25/2022-13:44:41] [V] [TRT] Using cuDNN as a tactic source [03/25/2022-13:44:44] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +510, now: CPU 905, GPU 3377 (MiB) [03/25/2022-13:44:44] [W] [TRT] Detected invalid timing cache, setup a local cache instead [03/25/2022-13:44:44] [V] [TRT] Constructing optimization profile number 0 [1/1]. [03/25/2022-13:44:44] [V] [TRT] *************** Autotuning format combination: Float(400,4,4,1), Float(2000,20,1) -> Int32(1), Float(200,4,1), Float(50,1), Float(50,1) *************** [03/25/2022-13:44:44] [V] [TRT] *************** Autotuning format combination: Int32(1) -> Int32(16,1) *************** [03/25/2022-13:44:44] [V] [TRT] --------------- Timing Runner: (Unnamed Layer* 1) [Shuffle] (Shuffle) [03/25/2022-13:44:44] [V] [TRT] Tactic: 0 Time: 0.0146 [03/25/2022-13:44:44] [V] [TRT] Tactic: 1 Time: 0.03174 [03/25/2022-13:44:44] [V] [TRT] Fastest Tactic: 0 Time: 0.0146 [03/25/2022-13:44:44] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: Shuffle Tactic: 0 [03/25/2022-13:44:44] [V] [TRT] *************** Autotuning format combination: -> Int32(1,1) *************** [03/25/2022-13:44:44] [V] [TRT] *************** Autotuning format combination: Int32(16,1), Int32(1,1) -> Int32(16,1) *************** [03/25/2022-13:44:44] [V] [TRT] --------------- Timing Runner: node_of_result_detections (ElementWise) [03/25/2022-13:44:44] [V] [TRT] Tactic: 1 is the only option, timing skipped [03/25/2022-13:44:44] [V] [TRT] Fastest Tactic: 1 Time: 0 [03/25/2022-13:44:44] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: ElementWise Tactic: 1 [03/25/2022-13:44:44] [V] [TRT] Formats and tactics selection completed in 0.0316925 seconds. [03/25/2022-13:44:44] [V] [TRT] After reformat layers: 4 layers [03/25/2022-13:44:44] [V] [TRT] Block size 67108864 [03/25/2022-13:44:44] [V] [TRT] Block size 12800 [03/25/2022-13:44:44] [V] [TRT] Block size 3584 [03/25/2022-13:44:44] [V] [TRT] Block size 3584 [03/25/2022-13:44:44] [V] [TRT] Block size 512 [03/25/2022-13:44:44] [V] [TRT] Block size 1 [03/25/2022-13:44:44] [V] [TRT] Total Activation Memory: 67129345 [03/25/2022-13:44:44] [I] [TRT] Detected 2 inputs and 1 output network tensors. [03/25/2022-13:44:44] [V] [TRT] Layer: node_of_num_detections HostPersistent: 48 DevicePersistent: 0 [03/25/2022-13:44:44] [V] [TRT] Layer: dummy_coeff HostPersistent: 0 DevicePersistent: 0 [03/25/2022-13:44:44] [V] [TRT] Layer: node_of_result_detections HostPersistent: 0 DevicePersistent: 0 [03/25/2022-13:44:44] [I] [TRT] Total Host Persistent Memory: 48 [03/25/2022-13:44:44] [I] [TRT] Total Device Persistent Memory: 0 [03/25/2022-13:44:44] [I] [TRT] Total Scratch Memory: 67840 [03/25/2022-13:44:44] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB [03/25/2022-13:44:44] [V] [TRT] Using cublas a tactic source [03/25/2022-13:44:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 929, GPU 3431 (MiB) [03/25/2022-13:44:44] [V] [TRT] Using cuDNN as a tactic source [03/25/2022-13:44:44] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 929, GPU 3439 (MiB) [03/25/2022-13:44:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 929, GPU 3439 (MiB) [03/25/2022-13:44:44] [V] [TRT] Engine generation completed in 4.44552 seconds. [03/25/2022-13:44:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 929, GPU 3439 (MiB) [03/25/2022-13:44:44] [V] [TRT] Engine Layer Information: Layer(PluginV2): node_of_num_detections, Tactic: 0, boxes[Float(16,100,1,4)], scores[Float(16,100,20)] -> num_detections[Int32(16)], nmsed_boxes[Float(16,50,4)], nmsed_scores[Float(16,50)], nmsed_classes[Float(16,50)] Layer(Constant): dummy_coeff, Tactic: 0, -> (Unnamed Layer* 2) [Constant]_output[Int32(16,1)] Layer(ElementWise): node_of_result_detections, Tactic: 1, (Unnamed Layer* 1) [Shuffle]_output[Int32(1,16)], (Unnamed Layer* 2) [Constant]_output[Int32(16,1)] -> result_detections[Int32(16,16)] [03/25/2022-13:44:44] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 929 MiB, GPU 3439 MiB [03/25/2022-13:44:44] [I] [TRT] Loaded engine size: 0 MB [03/25/2022-13:44:44] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 929 MiB, GPU 3440 MiB [03/25/2022-13:44:44] [V] [TRT] Using cublas a tactic source [03/25/2022-13:44:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 929, GPU 3440 (MiB) [03/25/2022-13:44:44] [V] [TRT] Using cuDNN as a tactic source [03/25/2022-13:44:44] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 929, GPU 3440 (MiB) [03/25/2022-13:44:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 929, GPU 3440 (MiB) [03/25/2022-13:44:44] [V] [TRT] Deserialization required 15882 microseconds. [03/25/2022-13:44:44] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 929 MiB, GPU 3440 MiB [03/25/2022-13:44:44] [I] Engine built in 7.24209 sec. &&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=minimal_nms.onnx --workspace=64 --saveEngine=wrongly_shape.plan --buildOnly --verbose ```
wraveane commented 2 years ago

The BatchedNMSPlugin class is implemented to support implicit batch mode. In this case, all shapes are given without a batch size, and so the getOutputDimensions() function gives only the shape of a tensor from the second dimension onward.

When the plugin runs in TensorRT, batch size dimensions will be assigned to all shapes correctly and as expected, whether the network runs in implicit or explicit batch modes.

maminus commented 2 years ago

num_detections expected shape is

[batch_size, 1]
             ^

so second dimension is 1.

but getOutputDimensions() does not report second dimension.

         // num_detections
         if (index == 0)
         {
             Dims dim0{};
             dim0.nbDims = 0;  // dose **not** report second dimension now.
                               // but we want second dimension `1`
             return dim0;
         }

so now we get num_detections shape of

[batch_size]
           ^
nvpohanh commented 2 years ago
num_detections The num_detections output is of shape [batch_size]. It is an int32 tensor indicating the number of valid detections per batch item. It can be less than keepTopK. Only the top num_detections[i] entries in nmsed_boxes[i], nmsed_scores[i] and nmsed_classes[i] are valid.

The doc has been updated: https://github.com/NVIDIA/TensorRT/tree/main/plugin/batchedNMSPlugin#structure

Closing this for now. Please reopen if the issue still exists. Thanks