Closed laggui closed 1 year ago
I think it's not easy to WAR this, we plan to deprecate EfficientNMS plugins and use INMSLayer instead, from the API doc:
There is a hardware-dependent limit K such that only the K highest scoring boxes in each batch item will be considered for selection. The value of K is 2000 for SM 5.3 and 6.2 devices, and 5000 otherwise.
From the code seems it's limited by registers per block. @samurdhikaru @ttyio do you have any suggestions? Thanks!
@zerollzeng Thanks for the reference. I see that the limit also applies to the INMSLayer.
Regardless of the per-hardware limit, I am using the TensorRT EP w/ onnxruntime for my application which makes the plugins easy to use/insert. I'd have to check documentation for the layer API if I want to migrate to INMSLayer for the future, but if you have any pointers for usage w/ ONNX models let me know :)
Any idea how long until the NMS plugins are deprecated?
I think you can migrate to the INMSLayer now. then you can have the new optimization/enhancement immediately once we have.
@samurdhikaru @ttyio just following-up to see if one of you might have any suggestions.
Thanks :)
Sorry this is documented limitation and I do not know the workaround ;-(
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_n_m_s_layer.html
closing since this is known limitation. thanks!
Description
I have an object detection model I exported to ONNX with the EfficientNMS_TRT node. I use ONNX Runtime w/ the TensorRT EP.
Today I realized the maximum number of detected boxes with the EfficientNMS plugin seems to be maxed out at 5000 here even if we set the
max_output_boxes
value higher.Is there any way around this limitation? I also saw that on the Jetson TX1/TX2 this seems to be maxed out at 2000.
This seems to be a limitation that is not explicitly mentioned anywhere in the plugin documentation. The only limitations addressed are w.r.t
batch_size * max_output_boxes_per_class * num_classes
, but in my use-case there is only one class anyway.FWIW, when using the standard NonMaxSuppression operator w/ CUDA for the same model I am able to detect the almost 7900 objects in the image. But with the EfficientNMS plugin it caps out at 5000 detections.
Environment
TensorRT Version: 8.5.3.1
NVIDIA GPU: Titan RTX
NVIDIA Driver Version: 525.105.17
CUDA Version: 11.6
CUDNN Version: 8.5.0.96
Operating System:
Python Version (if applicable): 3.10.9
Relevant Files
I cannot distribute the model at this time, but if required I could produce a MWE.
Steps To Reproduce
Same as above, but if required I could produce a MWE.