NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.67k stars 2.12k forks source link

Faster RCNN on TensorRT from TLT #2365

Closed codesian closed 1 year ago

codesian commented 2 years ago

Hi!

We train a Faster Rcnn model with resnet18 backbone on TLT 3.0 container, train and evaluate test, inference works perfect with int8 calibration. Here TLT config:

faster_rcnn_config.txt

We export model to an .etlt file, we call our output tensor NMS -o option. After this we export model with TLT-converter tool on Jetson NX with calibration options and sizes. Seems ok. We are using Tensor RT on a c++ environment to make inference. Tensor input/outputs sizes are:

0 - input_image
     · 0 - 3
     · 0 - 1080
     · 0 - 1920
1 - NMS
     · 0 - 1
     · 0 - 100
     · 0 - 7
2 - NMS_1
     · 0 - 1
     · 0 - 1
     · 0 - 1

At this time, detections seems to be less than on TLT container inference, we think that its problem of image preprocess before pass to the input tensor. We have a Opencv Mat as RGB source image (image).

As result we have 20-30% detections in coparision with tlt tests. As you can see reverse RGB order to BGR, substract channel mean and divide by 1.0 as tlt documentation says on input_image_config parameter specification.

float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer("input_image"));
float pixelMean[3]{103.939,116.779, 123.68};

for (int i = 0, volImg = C * H * W; i < 1; ++i){
     for (int c = 0; c < C; ++c){
         for (unsigned j = 0, volChl = H * W; j < volChl; ++j){
              hostDataBuffer[i * volImg + c * volChl + j] = float(((float(image.data[j * C + 2- c])) - pixelMean[c]))/1.0F;
         }
     }
}

buffers.copyInputToDevice();
bool status{true};
status = context->execute(1, buffers.getDeviceBindings().data());
buffers.copyOutputToHost();
const float* nms = static_cast<const float*>(buffers.getHostBuffer("NMS"));
for (int det_id = 0; det_id <100; det_id++){
      float x1 =  nms[det_id * 7 + 3];
}

I know that there are more recent versions, TAO and Jetpacks. But we have this version in 100+ Clients. Update its not an option by now. We try training a FASTER RCNN with RESNET 50 backbone but seems the same. Mask RCNN Resnet 50 on TLT 3.0 works perfect.

Documentation about Faster on TLT its poor and outdated, NMS tensors are only refences on documentation, and tensor RT samples not cover this.

TensorRT Version: 4. GPU Type: JETSON NX XAVIER Jetpack: 32.6 4.6 Cuda: 10.2 cuDNN: 8.2.1 TensorRT: 7.2 Operating System + Version: Ubuntu 18 + Jetpack

zerollzeng commented 2 years ago

It's a TAO SDK issue. Please ask at https://forums.developer.nvidia.com/c/accelerated-computing/intelligent-video-analytics/tao-toolkit/17.

codesian commented 2 years ago

I ask on forum, no response at all. Sorry but i think its not a TAO issue, i think its related to NMS pluggin. MaskRcnn not use this pluggin and works ok. I try different backbones Resnet18, Resnet50, different Models YOLO, FASTER... and output was wrong. Can you confirm that image preprocess its correct for this model?

zerollzeng commented 2 years ago

Can you confirm that image preprocess its correct for this model?

I'm not an expert in TAO, but the image preprocessing is not handled by TRT, so there is nothing we can do with the preprocess :)

ttyio commented 1 year ago

closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!