ssd_mobilenet_v1_fpn_coco ONNX GraphSurgeon to TensorRT conversion assertion errors

jgocm commented 2 years ago

I have built the project on Jetpack 4.5.1 following the instructions from: https://github.com/NobuoTsukamoto/tensorrt-examples/issues/3#issuecomment-1062732055tensor

Then, created ONNX GraphSurgeon for ssd_mobilenet_v1_fpn_coco using the Add_TFLiteNMS_Plugin notebook, by replacing "ssdlite_mobilenet_v2_coco_2018_05_09" model with "ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03" and input shapes from "input_shapes=1,300,300,3" to "input_shapes=1,640,640,3". Here is my TFLiteNMS_Plugin notebook.

When trying to convert the model from onnx to trt on Jetson Nano with convert_onnxgs2trt.py, got the following error:

Beginning ONNX file parsing
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] INTERNAL ERROR: Assertion failed: inputs[0].nbDims == 3
/home/joao/tensorrt-examples/TensorRT/plugin/tfliteNMSPlugin/tfliteNMSPlugin.cpp:74
Aborting...

Looking deeper into tfliteNMSPlugin.cpp, this assertion comes from TFLiteNMSPlugin::getOutputDimensions, at line 74. Just to check, I tried printing the number of dimensions of the inputs from the input vector:

Dims TFLiteNMSPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) noexcept
{
    printf("inputs[0] dimensions: %i\n", inputs[0].nbDims);
    printf("inputs[1] dimensions: %i\n", inputs[1].nbDims);
    printf("inputs[2] dimensions: %i\n", inputs[2].nbDims);
    ASSERT(nbInputDims == 3);
    ASSERT(index >= 0 && index < this->getNbOutputs());
    ASSERT(inputs[0].nbDims == 3);
    ASSERT(inputs[1].nbDims == 2 || (inputs[1].nbDims == 3 && inputs[1].d[2] == 1));
    ASSERT(inputs[2].nbDims == 2);
    // boxesSize: number of box coordinates for one sample
    boxesSize = inputs[0].d[0] * inputs[0].d[1] * inputs[0].d[2];
    // scoresSize: number of scores for one sample
    scoresSize = inputs[1].d[0] * inputs[1].d[1];
    // anchorSize: number of anchors for one sample
    anchorsSize = inputs[2].d[0] * inputs[2].d[2];

    // num_detections
    if (index == 0)
    {
        Dims dim0{};
        dim0.nbDims = 0;
        return dim0;
    }
    // nmsed_boxes
    if (index == 1)
    {
        return DimsHW(param.max_detections, 4);
    }
    // nmsed_scores or nmsed_classes
    Dims dim1{};
    dim1.nbDims = 1;
    dim1.d[0] = param.max_detections;
    return dim1;
}

The result was:

inputs[0] dimensions: 2
inputs[1] dimensions: 2
inputs[2] dimensions: 2
[TensorRT] INTERNAL ERROR: Assertion failed: inputs[0].nbDims == 3

So my questions are:

Have you been able to use the NMS Plugin for any other networks from Tensorflow 1 Model Zoo than SSDLite MobileNetv2? If yes, how did you do it?
If not, do you know what might be causing this input dimensions errors?

Here is my network ONNX GraphSurgeon file.

EDIT: Sharing also an image of the network generated from Netron. ssd_mobilenet_v1_fpn_640x640_gs

NobuoTsukamoto commented 2 years ago

Have you been able to use the NMS Plugin for any other networks from Tensorflow 1 Model Zoo than SSDLite MobileNetv2? If yes, how did you do it?

ssd_mobilenet_v2_300x300 and ssdlite_mobiledet_gpu_320x320 (I created this plugin for the purpose of running ssdlite_mobiledet_gpu with jetson.)

If not, do you know what might be causing this input dimensions errors?

For nfp model, the input of TFLite_Detection_PostProcess is different.

ssd_mobilenet_v2_300x300:

boxes: 1917x1x4 (3dims)
scores: 1917x91 (2dims)
anchors: 1917x4 (2dims)

ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco

boxes: 19125x4 (2dims)
scores: 19125x91 (2dims)
anchors: 19125x4 (2dims)

Therefore, if you delete ASSERT of tfliteNMSPlugin.cpp, I think that it will work for the time being.

NobuoTsukamoto commented 2 years ago

It seems that it cannot be dealt with only by deleting ASSERT. Checking details.

NobuoTsukamoto commented 2 years ago

Fixed a problem with the fpn model and committed.

However, Jetpack 4.5.1 couldn't convert the fpn model and couldn't confirm it.

ssd_mobilenet_v1_fpn_shared_box_predictor_640x640 could not be converted due to lack of memory on Jetson Nano.
ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320 will result in an error at other layers.

In TensorRT 8.2 (JetPack4.6), the problem of ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320 was solved and conversion was possible. For this reason, we plan to proceed with confirmation with JetPack 4.6.1. I will share it when I can confirm it.

NobuoTsukamoto commented 2 years ago

I have confirmed that ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320 works with Jetpack 4.6.1. Thanks you very much for reporting the problem.

Build plugin.

# Jetson Nano
git clone https://github.com/NobuoTsukamoto/tensorrt-examples
cd ./tensorrt-examples/python/detection
git submodule update --init --recursive
export TRT_LIBPATH=`pwd`/TensorRT
export PATH=${PATH}:/usr/local/cuda/bin
cd $TRT_LIBPATH
cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 -DCUDA_VERSION=10.2 -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/usr/bin/gcc
make -j$(nproc)

sudo cp out/libnvinfer_plugin.so.8.2.0 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8.2.0
sudo rm /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8
sudo ln -s /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8.2.0 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8

 ls -al /usr/lib/aarch64-linux-gnu/libnvinfer_plugin*
lrwxrwxrwx 1 root root       49 Mar 18 20:47 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so -> /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8
lrwxrwxrwx 1 root root       53 Mar  8 18:29 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8 -> /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8.2.0
-rw-r--r-- 1 root root 18280576 Jun 26  2021 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8.0.1
-rwxr-xr-x 1 root root 41492816 Mar 18 21:12 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8.2.0
-rw-r--r-- 1 root root 21018654 Jun 26  2021 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin_static.a

check trtexec (Conversions and trtexecs must be done in CUI mode.)

# Jetson Nano

sudo systemctl set-default multi-user.target
sudo reboot

/usr/src/tensorrt/bin/trtexec --onnx=/home/jetson/tensorrt-examples/models/ssd_mobilenet_v2_mnasfpn_shared_box_predi    ctor_320x320_coco_gs.onnx --workspace=2048

...
[03/19/2022-12:44:00] [I] Starting inference
[03/19/2022-12:44:03] [I] Warmup completed 3 queries over 200 ms
[03/19/2022-12:44:03] [I] Timing trace has 39 queries over 3.12049 s
[03/19/2022-12:44:03] [I]
[03/19/2022-12:44:03] [I] === Trace details ===
[03/19/2022-12:44:03] [I] Trace averages of 10 runs:
[03/19/2022-12:44:03] [I] Average on 10 runs - GPU latency: 79.8587 ms - Host latency: 79.9861 ms (end to end 79.9958 ms, enqueue 12.687 ms)
[03/19/2022-12:44:03] [I] Average on 10 runs - GPU latency: 79.8463 ms - Host latency: 79.9742 ms (end to end 79.9842 ms, enqueue 12.0583 ms)
[03/19/2022-12:44:03] [I] Average on 10 runs - GPU latency: 79.847 ms - Host latency: 79.9747 ms (end to end 79.9847 ms, enqueue 12.6927 ms)
[03/19/2022-12:44:03] [I]
[03/19/2022-12:44:03] [I] === Performance summary ===
[03/19/2022-12:44:03] [I] Throughput: 12.498 qps
[03/19/2022-12:44:03] [I] Latency: min = 79.7407 ms, max = 80.3516 ms, mean = 80.0022 ms, median = 79.9524 ms, percentile(99%) = 80.3516 ms
[03/19/2022-12:44:03] [I] End-to-End Host Latency: min = 79.75 ms, max = 80.3616 ms, mean = 80.012 ms, median = 79.9622 ms, percentile(99%) = 80.3616 ms
[03/19/2022-12:44:03] [I] Enqueue Time: min = 9.36206 ms, max = 12.8073 ms, mean = 12.3673 ms, median = 12.7076 ms, percentile(99%) = 12.8073 ms
[03/19/2022-12:44:03] [I] H2D Latency: min = 0.119629 ms, max = 0.123169 ms, mean = 0.12132 ms, median = 0.121338 ms, percentile(99%) = 0.123169 ms
[03/19/2022-12:44:03] [I] GPU Compute Time: min = 79.6128 ms, max = 80.2219 ms, mean = 79.8744 ms, median = 79.8252 ms, percentile(99%) = 80.2219 ms
[03/19/2022-12:44:03] [I] D2H Latency: min = 0.00488281 ms, max = 0.00708008 ms, mean = 0.00644469 ms, median = 0.00640869 ms, percentile(99%) = 0.00708008 ms
[03/19/2022-12:44:03] [I] Total Host Walltime: 3.12049 s
[03/19/2022-12:44:03] [I] Total GPU Compute Time: 3.1151 s
[03/19/2022-12:44:03] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/19/2022-12:44:03] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/jetson/tensorrt-examples/models/ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320_coco_gs.onnx --workspace=2048
[03/19/2022-12:44:03] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 879, GPU 3078 (MiB)

run demo ( I created fp16 model )

# Jetson Nano
sudo systemctl set-default graphical.target
sudo reboot

python3 trt_detection.py --model ../../models/ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320_coco_fp16.trt --label ../../models/coco_labels.txt --width 320 --height 320 --videopath __PATH_TO_VIDEOFILE

https://user-images.githubusercontent.com/17954673/159104818-2f3f483e-d0ef-4ded-95d4-cc92012773c4.mp4

NobuoTsukamoto / tensorrt-examples

ssd_mobilenet_v1_fpn_coco ONNX GraphSurgeon to TensorRT conversion assertion errors #4