NVIDIA-AI-IOT / yolo_deepstream

yolo model qat and deploy with deepstream&tensorrt
Apache License 2.0
534 stars 135 forks source link

There are some errors after adding "BatchedNMS_TRT" layer #3

Open larryhu76 opened 3 years ago

larryhu76 commented 3 years ago

Description: I got YOLOv4 ONNX mode (yolov4_1_3_608_608_static.onnx) from https://github.com/Tianxiaomo/pytorch-YOLOv4, then used the command : "python .\onnx_add_nms_plugin.py -f .\yolov4_1_3_608_608_static.onnx -t 2000 -k 100 "
to add "BatchedNMS_TRT" layer,and got a new mode ( yolov4_1_3_608_608_static.nms.onnx). But when I used the command : "trtexec --onnx=yolov4_1_3_608_608_static.nms.onnx --explicitBatch --saveEngine=tensorRT-eng --workspace=4096 " to convert the model ,there were some errors, here is the log: [11/25/2020-12:10:31] [I] === Model Options === [11/25/2020-12:10:31] [I] Format: ONNX [11/25/2020-12:10:31] [I] Model: yolov4_1_3_608_608_static.nms.onnx [11/25/2020-12:10:31] [I] Output: [11/25/2020-12:10:31] [I] === Build Options === [11/25/2020-12:10:31] [I] Max batch: explicit [11/25/2020-12:10:31] [I] Workspace: 4096 MB [11/25/2020-12:10:31] [I] minTiming: 1 [11/25/2020-12:10:31] [I] avgTiming: 8 [11/25/2020-12:10:31] [I] Precision: FP32 [11/25/2020-12:10:31] [I] Calibration: [11/25/2020-12:10:31] [I] Safe mode: Disabled [11/25/2020-12:10:31] [I] Save engine: tensorRT-eng [11/25/2020-12:10:31] [I] Load engine: [11/25/2020-12:10:31] [I] Builder Cache: Enabled [11/25/2020-12:10:31] [I] NVTX verbosity: 0 [11/25/2020-12:10:31] [I] Inputs format: fp32:CHW [11/25/2020-12:10:31] [I] Outputs format: fp32:CHW [11/25/2020-12:10:31] [I] Input build shapes: model [11/25/2020-12:10:31] [I] Input calibration shapes: model [11/25/2020-12:10:31] [I] === System Options === [11/25/2020-12:10:31] [I] Device: 0 [11/25/2020-12:10:31] [I] DLACore: [11/25/2020-12:10:31] [I] Plugins: [11/25/2020-12:10:31] [I] === Inference Options === [11/25/2020-12:10:31] [I] Batch: Explicit [11/25/2020-12:10:31] [I] Input inference shapes: model [11/25/2020-12:10:31] [I] Iterations: 10 [11/25/2020-12:10:31] [I] Duration: 3s (+ 200ms warm up) [11/25/2020-12:10:31] [I] Sleep time: 0ms [11/25/2020-12:10:31] [I] Streams: 1 [11/25/2020-12:10:31] [I] ExposeDMA: Disabled [11/25/2020-12:10:31] [I] Spin-wait: Disabled [11/25/2020-12:10:31] [I] Multithreading: Disabled [11/25/2020-12:10:31] [I] CUDA Graph: Disabled [11/25/2020-12:10:31] [I] Skip inference: Disabled [11/25/2020-12:10:31] [I] Inputs: [11/25/2020-12:10:31] [I] === Reporting Options === [11/25/2020-12:10:31] [I] Verbose: Disabled [11/25/2020-12:10:31] [I] Averages: 10 inferences [11/25/2020-12:10:31] [I] Percentile: 99 [11/25/2020-12:10:31] [I] Dump output: Disabled [11/25/2020-12:10:31] [I] Profile: Disabled [11/25/2020-12:10:31] [I] Export timing to JSON file: [11/25/2020-12:10:31] [I] Export output to JSON file: [11/25/2020-12:10:31] [I] Export profile to JSON file: [11/25/2020-12:10:31] [I]

Input filename: yolov4_1_3_608_608_static.nms.onnx ONNX IR version: 0.0.7 Opset version: 11 Producer name: Producer version: Domain: Model version: 0 Doc string:

[11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [11/25/2020-12:10:32] [I] [TRT] ModelImporter.cpp:135: No importer registered for op: BatchedNMS_TRT. Attempting to import as plugin. [11/25/2020-12:10:32] [I] [TRT] builtin_op_importers.cpp:3659: Searching for plugin: BatchedNMS_TRT, plugin_version: 1, plugin_namespace: [11/25/2020-12:10:32] [I] [TRT] builtin_op_importers.cpp:3676: Successfully created plugin: BatchedNMS_TRT [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [W] [TRT] Output type must be INT32 for shape outputs [11/25/2020-12:10:32] [W] [TRT] Output type must be INT32 for shape outputs [11/25/2020-12:10:32] [W] [TRT] Output type must be INT32 for shape outputs [11/25/2020-12:10:32] [W] [TRT] Output type must be INT32 for shape outputs [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions. [11/25/2020-12:10:32] [E] [TRT] Layer (Unnamed Layer 1330) [PluginV2Ext] failed validation [11/25/2020-12:10:32] [E] [TRT] Network validation failed. [11/25/2020-12:10:33] [E] Engine creation failed [11/25/2020-12:10:33] [E] Engine set up failed

Does anyone know why this happens?

Environment TensorRT Version:TensorRT-7.1.3.4 GPU Type: 1080ti CUDA Version: cuda_11.0.3_451.82_win10 Operating System :Windows 10 Python Version (if applicable): 3.7 PyTorch Version (if applicable): 1.8.0.dev20201118

david9ml commented 3 years ago

same problem here

PCH10507323 commented 3 years ago

Hi @larryhu76 @david9ml , Have you know how to solve the problem? I also met same problem. Thanks.

mchi-zg commented 3 years ago

please build TRT OSS follow the README in this project and make sure the TRT OSS include the "IPluginV2DynamicExt" related change in https://github.com/NVIDIA/TensorRT/commits/master/plugin/batchedNMSPlugin

fedyok8 commented 3 years ago

in file onnx_add_nms_plugin.py I changed line 41 from op="BatchedNMS_TRT", to op="BatchedNMSDynamic_TRT",. Using a modified onnx_add_nms_plugin.py I generated a new onnx model and it worked successfully.

jstumpin commented 3 years ago

Following @fedyok8 advice I was able to convert yolov4_-1_3_320_320_dynamic.onnx.nms.onnx into yolov4.trt with the -fp16 argument like so:

[01/02/2021-10:01:04] [V] [TRT] Layer(PluginV2): (Unnamed Layer* 3362) [PluginV2DynamicExt], Tactic: 0, boxes[Float(6300,1,4)], confs[Float(6300,80)] -> num_detections[Int32(1)], nmsed_boxes[Float(50,4)], nmsed_scores[Float(50)], nmsed_classes[Float(50)] [01/02/2021-10:01:05] [I] TRT Engine file saved to: ../../../data/yolov4.trt 4 [01/02/2021-10:01:05] [I] Loading or building yolo model done [01/02/2021-10:01:05] [W] [TRT] TensorRT was linked against cuDNN 8.0.4 but loaded cuDNN 8.0.1 [01/02/2021-10:01:05] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.1.0 [01/02/2021-10:01:05] [V] [TRT] Allocated persistent device memory of size 130175488 [01/02/2021-10:01:05] [V] [TRT] Allocated activation device memory of size 26244096 [01/02/2021-10:01:05] [V] [TRT] Assigning persistent memory blocks for various profiles batch size: 1 Time consumed in preProcess: 248 Time consumed in model: 13 Time consumed in postProcess: 1 [01/02/2021-10:01:05] [I] Inference of yolo model done

Do note that these lines need to be added as suggested here:

IOptimizationProfile* profile = builder->createOptimizationProfile();
profile->setDimensions("input", OptProfileSelector::kMIN, Dims4(1, 3, 320, 320));
profile->setDimensions("input", OptProfileSelector::kOPT, Dims4(1, 3, 320, 320));
profile->setDimensions("input", OptProfileSelector::kMAX, Dims4(1, 3, 320, 320));
config->addOptimizationProfile(profile);

However detection count is zero using the -demo argument like so:

[01/02/2021-10:03:00] [I] Building and running a GPU inference engine for Yolo [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::CropAndResize version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::Proposal version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::Region_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1 [01/02/2021-10:03:00] [V] [TRT] Registered plugin creator - ::Split version 1 [01/02/2021-10:03:02] [W] [TRT] TensorRT was linked against cuDNN 8.0.4 but loaded cuDNN 8.0.1 [01/02/2021-10:03:02] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.1.0 [01/02/2021-10:03:02] [V] [TRT] Deserialize required 1916175 microseconds. [01/02/2021-10:03:03] [I] TRT Engine loaded from: ../../../data/yolov4.trt [01/02/2021-10:03:03] [I] Loading or building yolo model done [01/02/2021-10:03:03] [W] [TRT] TensorRT was linked against cuDNN 8.0.4 but loaded cuDNN 8.0.1 [01/02/2021-10:03:03] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.1.0 [01/02/2021-10:03:03] [V] [TRT] Allocated persistent device memory of size 130175488 [01/02/2021-10:03:03] [V] [TRT] Allocated activation device memory of size 26244096 [01/02/2021-10:03:03] [V] [TRT] Assigning persistent memory blocks for various profiles batch size: 1 ------------ Next Image! -------------- Number of detections: 0 [ 0 0 0 0 ] score: 0 class: 0

BTW, both darknet2onnx and pytorch2onnx lead to the same outcome. Any pointers folks?

juri-baumberger commented 3 years ago

I have the same issue, 0 detections when using (patched) BatchedNMS_TRT or BatchedNMSDynamic_TRT. I can convert and create engine, but all results are 0. If I omit the addition of the NMS layer and do NMS in e.g. openCV without acceleration, it works fine. Anyone got it running?

jstumpin commented 3 years ago

My apology for issuing a non-issue! I had mistakenly assigned topK and keepTopK with 1 and 5 respectively; currently is using 200 and 100 instead. As a consolation, hereby are my findings based on YOLOv4 with FP16 precision, 320x320 input size, 0.3 score confidence, 0.2 IOU threshold, and averaging over 3000 images:

config. wall-time in ms (σ) CPU % GPU % CPU MB GPU MB
cpu-gpu-cpu 8.38367 (1.35245) 14.3 29 1250 950
gpu-gpu-cpu 9.08967 (2.58372) 10.6 34 1360 1004
cpu-gpu-gpu 7.38833 (1.14231) 15.7 34 1250 950
gpu-gpu-gpu 7.70467 (2.86335) 10.7 35 1360 1004
specs.
CPU Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
GPU NVIDIA GeForce RTX 2070
RAM DDR4-2666 (1333 MHz) 16384MB
OS Microsoft Windows 10 Professional 64-bit (Build 21286)
S/W TensorRT v7.2.1.6, CUDA v11.0, CUDNN v8.0, OpenCV v4.4.0

where config. denotes preprocess-infer-postprocess:

spacewalk01 commented 3 years ago

in file onnx_add_nms_plugin.py I changed line 41 from op="BatchedNMS_TRT", to op="BatchedNMSDynamic_TRT",. Using a modified onnx_add_nms_plugin.py I generated a new onnx model and it worked successfully.

It works for me too! Thanks.

spacewalk01 commented 3 years ago

However, I got 0 number of detections. I tried the following function to process input frames. But it didn't produce any output, so I added resized.convertTo(flt_image, CV_32FC3, 1.f / 255.f) but there was still no output. Did anyone have the same issue?

void batchPreprocess(const samplesCommon::BufferManager& buffers, std::vector<cv::Mat> &frames) {

    float* hostInputBuffer = static_cast<float*>(buffers.getHostBuffer("input"));

    // Load input video
    std::vector<std::vector<cv::Mat>> input_channels;
    for (int b = 0; b < batchSize; ++b)
    {
        input_channels.push_back(std::vector<cv::Mat> {static_cast<size_t>(numChannels)});
    }

    for (int b = 0; b < batchSize; ++b)
    {
        cv::Mat rgb_img;
        cv::cvtColor(frames[b], rgb_img, cv::COLOR_BGR2RGB);

        auto size = cv::Size(inputW, inputH);
        cv::Mat resized;
        cv::resize(rgb_img, resized, size, 0, 0, cv::INTER_LINEAR);

        cv::split(resized, input_channels[b]);
    }
}
jstumpin commented 3 years ago

@batselem Yeah got the invisible detection symptom due to my own tempering of the default settings. Other than that, the [0, 1] scaling as per your implementation should address such cases.

zahidzqj commented 1 year ago

I attempted to load ‘engine’ by : with open(‘v4.engine’, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime: engine1 = runtime.deserialize_cuda_engine(f.read()) error: [TensorRT] ERROR: INVALID_ARGUMENT: getPluginCreator could not find plugin BatchedNMSDynamic_TRT version 1 [TensorRT] ERROR: safeDeserializationUtils.cpp (322) - Serialization Error in load: 0 (Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry) [TensorRT] ERROR: INVALID_STATE: std::exception [TensorRT] ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.

I have replaced '*.so'. How to solve this problem?

jstumpin commented 1 year ago

I attempted to load ‘engine’ by : with open(‘v4.engine’, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime: engine1 = runtime.deserialize_cuda_engine(f.read()) error: [TensorRT] ERROR: INVALID_ARGUMENT: getPluginCreator could not find plugin BatchedNMSDynamic_TRT version 1 [TensorRT] ERROR: safeDeserializationUtils.cpp (322) - Serialization Error in load: 0 (Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry) [TensorRT] ERROR: INVALID_STATE: std::exception [TensorRT] ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.

I have replaced '*.so'. How to solve this problem?

I'd like to recommend this repo instead (more YOLO variants, same BatchedNMS_TRT support; DeepStream is unnecessary, can be safely omitted from the code): https://github.com/marcoslucianops/DeepStream-Yolo