NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.62k stars 2.11k forks source link

EfficientNMS Plugin max numSelectedBoxes = 5000 #3000

Closed laggui closed 1 year ago

laggui commented 1 year ago

Description

I have an object detection model I exported to ONNX with the EfficientNMS_TRT node. I use ONNX Runtime w/ the TensorRT EP.

Today I realized the maximum number of detected boxes with the EfficientNMS plugin seems to be maxed out at 5000 here even if we set the max_output_boxes value higher.

Is there any way around this limitation? I also saw that on the Jetson TX1/TX2 this seems to be maxed out at 2000.

This seems to be a limitation that is not explicitly mentioned anywhere in the plugin documentation. The only limitations addressed are w.r.t batch_size * max_output_boxes_per_class * num_classes, but in my use-case there is only one class anyway.

FWIW, when using the standard NonMaxSuppression operator w/ CUDA for the same model I am able to detect the almost 7900 objects in the image. But with the EfficientNMS plugin it caps out at 5000 detections.

Environment

TensorRT Version: 8.5.3.1

NVIDIA GPU: Titan RTX

NVIDIA Driver Version: 525.105.17

CUDA Version: 11.6

CUDNN Version: 8.5.0.96

Operating System:

Python Version (if applicable): 3.10.9

Relevant Files

I cannot distribute the model at this time, but if required I could produce a MWE.

Steps To Reproduce

Same as above, but if required I could produce a MWE.

zerollzeng commented 1 year ago

I think it's not easy to WAR this, we plan to deprecate EfficientNMS plugins and use INMSLayer instead, from the API doc:

There is a hardware-dependent limit K such that only the K highest scoring boxes in each batch item will be considered for selection. The value of K is 2000 for SM 5.3 and 6.2 devices, and 5000 otherwise.

From the code seems it's limited by registers per block. @samurdhikaru @ttyio do you have any suggestions? Thanks!

laggui commented 1 year ago

@zerollzeng Thanks for the reference. I see that the limit also applies to the INMSLayer.

Regardless of the per-hardware limit, I am using the TensorRT EP w/ onnxruntime for my application which makes the plugins easy to use/insert. I'd have to check documentation for the layer API if I want to migrate to INMSLayer for the future, but if you have any pointers for usage w/ ONNX models let me know :)

Any idea how long until the NMS plugins are deprecated?

zerollzeng commented 1 year ago

I think you can migrate to the INMSLayer now. then you can have the new optimization/enhancement immediately once we have.

laggui commented 1 year ago

@samurdhikaru @ttyio just following-up to see if one of you might have any suggestions.

Thanks :)

ttyio commented 1 year ago

Sorry this is documented limitation and I do not know the workaround ;-(

https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_n_m_s_layer.html

ttyio commented 1 year ago

closing since this is known limitation. thanks!