Batch inference support

jstumpin commented 4 years ago

How do we extend the inference function to support batchSize > 1? For batched inputs, I'm using OpenCV's blobFromImages. It seems to work just fine as batchSize = 1 (using model/weights with maxBatchSize of 2). But how do I parse the output? How do I get to the subsequent batchId?

Thanks.

CaoWGG commented 4 years ago

@jstumpin you need to develop doNms and resizeAndNorm to support batch infer.

jstumpin commented 4 years ago

For post-processing, I'm using OpenCV's NMS and as for pre-processing, I'm using letterboxing from NVIDIA's original YOLO repo. Just couldn't figure out how to offset mCudaBuffers to get to the next batchId since number of detections is extracted from mCudaBuffers[1].

CaoWGG commented 4 years ago

@jstumpin you can refer to ttps://github.com/CaoWGG/TensorRT-YOLOv4/blob/4d7c2edce99e8794a4cb4ea3540d51ce91158a36/onnx-tensorrt/yolo.cu#L52

jstumpin commented 4 years ago

If yololayer is readily supporting batching, then, how do I get number of detections (det) for subsequent output of the output batch? Because from yoloNet::Infer, outputData is populated from mCudaBuffers[1] where count = sizeof(float) + int(det)7 sizeof(float) and in the previous line, det is extracted from mCudaBuffers[1] where count = sizeof(int).

yiwenwan2008 commented 4 years ago

@jstumpin have you figured it out?

yiwenwan2008 commented 4 years ago

If yololayer is readily supporting batching, then, how do I get number of detections (det) for subsequent output of the output batch? Because from yoloNet::Infer, outputData is populated from mCudaBuffers[1] where count = sizeof(float) + int(det)7 sizeof(float) and in the previous line, det is extracted from mCudaBuffers[1] where count = sizeof(int).

did you set batchSize instead of 1, mContext->execute(batchSize,&mCudaBuffers[0]); if batchSize doesnt equals to 1, batchId will be the key for grouping detections. refer trt.cpp:L48 int batchId = temp[6];

yiwenwan2008 commented 4 years ago

However, the results is not looking right even though i passed in with exactly same images.

jstumpin commented 4 years ago

@CaoWGG was succinctly precise to point out data[6] = batch_id, that is literally the solution. There is no need to offset outputData to extract subsequent output, everything is already set in the infer function. So here's what I did @yiwenwan2008:

Convert Darknet weights into TensorRT weights via buildEngine with maxBatchSize = 2;
Clone an image and flipped it vertically to emulate batchSize = 2;
Letterbox the images via NVIDIA's original YOLO repo;
Convert vector of images into inputBlob via blobFromImages;
Do the inference with batchSize = 2;
Accumulate the output according to int(temp[6]) where float* temp = outputData.get() + 1;

Thanks again @CaoWGG for this speedy wrapper (fastest so far on Windows)!

yiwenwan2008 commented 4 years ago

@jstumpin Thank you for your solution. I will try your steps for batch > 1; Right now I am trying to make sure when batch=1, i am getting valid results. However, when i use blobfromImages i got different results than using resizeandnorm(); is there anything that i overlooked? As you can see, the boundingbox for the dog is not right and confidence level also changed(the result image above is when blobFromImages() is used).

   cv::Mat blob = cv::dnn::blobFromImages(images, 1.0/255.0, cv::Size(inputDim.d[2],inputDim.d[1]), cv::Scalar(0,0,0), true, false, CV_32F);
    CUDA_CHECK(cudaMemcpy(mCudaBuffers[0],blob.data,416*416*3*batchSize,cudaMemcpyHostToDevice));

result valid_result

jstumpin commented 4 years ago

@yiwenwan2008 as mentioned previously, blobfromImages is used for converting the images into input blob; I'm using this for pre-processing. Anyhow, here's the result:

yoloeddog

yiwenwan2008 commented 4 years ago

Thank! @jstumpin let me check what preprocessing you have done. By the way have you tried batch > 2...are you able to get expected detections? when i tried batch = 4, batch_id is only 0 or 1 :(

jstumpin commented 4 years ago

@CaoWGG I had to reset mCudaBuffers whenever batchSize is switched (e.g.: from 2 to 1) by re-initiating L161-L174. I don't have to re-initiate anything in the original NVIDIA YOLO repo. Although there isn't any noticeable overhead introduced by this re-initiation process, is there anything I can do to simplify things?

yiwenwan2008 commented 4 years ago

@jstumpin i do think image resize method matters. why we are getting different results is due to blobfromimage(), resizeandnorm() and cv::resize(m_OrigImage, m_LetterboxImage, cv::Size(resizeW, resizeH), 0, 0, cv::INTER_CUBIC) have different resize effect on the input images.

yiwenwan2008 commented 4 years ago

@jstumpin since i am also using darknet to train models, i need to use darknet's image preprocessing method. Thank you! Thank you for your great work and generous sharing @CaoWGG

jstumpin commented 4 years ago

@jstumpin i do think image resize method matters. why we are getting different results is due to blobfromimage(), resizeandnorm() and cv::resize(m_OrigImage, m_LetterboxImage, cv::Size(resizeW, resizeH), 0, 0, cv::INTER_CUBIC) have different resize effect on the input images.

I'm sure it does, never said it doesn't. Just clarifying that I didn't place any additional parameter into blobfromimage to bypass its internal pre-processing steps.

Thank! @jstumpin let me check what preprocessing you have done. By the way have you tried batch > 2...are you able to get expected detections? when i tried batch = 4, batch_id is only 0 or 1 :(

Did a quick batchSize = 4 conversion and the results sum up quite nicely.

yiwenwan2008 commented 4 years ago

@jstumpin it is so good to see batch inference is working for you. Did you try to see the speed gain with batch inference?

yiwenwan2008 commented 4 years ago

@jstumpin i am still not able to make it work with batchSize > 1; in initEngine(), nbBindings=4, and mCudaBuffers has size 4, input is mCudaBuffers[0], output is mCudaBuffers[1], do you know why mCudaBuffers has size 4 and mCudaBuffers[2] and mCudaBuffer[3] points to mCudaBuffers[1]....we only have output for one binding instead of three...

jstumpin commented 4 years ago

@jstumpin it is so good to see batch inference is working for you. Did you try to see the speed gain with batch inference?

Haven't done the full benchmark. Pending to compare against https://github.com/enazoe/yolo-tensorrt and https://github.com/opencv/opencv/issues/17795#issuecomment-656553410 with the latter seems to be more promising.

jstumpin commented 4 years ago

@jstumpin i am still not able to make it work with batchSize > 1; in initEngine(), nbBindings=4, and mCudaBuffers has size 4, input is mCudaBuffers[0], output is mCudaBuffers[1], do you know why mCudaBuffers has size 4 and mCudaBuffers[2] and mCudaBuffer[3] points to mCudaBuffers[1]....we only have output for one binding instead of three...

mCudaBuffers needs to be of size 4 due to 1 input + 3 yolo output layers. As for the output having a single binding, unlike three (even the original NVIDIA is having three D2H cudaMemcpy), I'd reckon the author would have the answer.

yiwenwan2008 commented 4 years ago

@jstumpin i was also looking into opencv dnn module, trying to make cuda work, fps using cpu is quiet low around 1fps

spacewalk01 commented 3 years ago

@jstumpin you need to develop doNms and resizeAndNorm to support batch infer.

@CaoWGG Hi, thank you for your wonderful implementation. I tried some preprocessing functions using opencv dnn. However, I noticed that your resize and norm kernel implementation runs much faster than OpenCV ‘dnn’. There are two ways to do computation in gpu as you know. I noticed that yours converts 2D images with 3 channels into 1D grid which works wonderfully. However, if I want to implement preprocessing kernel function (resizeAndNorm) for batch data, I wonder which grid, 1D grid or 2D grid should be better. I would appreciate your suggestion thank you.

1D ->  blockIdx.x | blockIdx.x * blockDim.x + threadIdx.x
2D -> blockIdx.x | blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x

Also, I noticed that doNms does not use gpu kernel and I would like to know the reason? why not use gpu for post processing?

jstumpin commented 3 years ago

@yiwenwan2008 @batselem Perhaps can consider NVIDIA official support for YOLOv4: https://github.com/NVIDIA-AI-IOT/yolov4_deepstream (includes GPU-based post-processing via batchedNMSplugin); benchmark can be found here.

spacewalk01 commented 3 years ago

@jstumpin thank you for your suggestion. I will try it.

spacewalk01 commented 3 years ago

@jstumpin I found out that in the implementation you suggested, the author uses methods like cv::Size which runs on CPU. I tried this method before with both cv::cuda::Resize and cv::Size, they were both very slow.


    if (this->mParams.cocoTest)
    {
        for (int b = 0; b < inputB; ++b)
        {
            if (this->mImageIdx + b < this->mImageFiles.size())
            {
                cv::Mat test_img = cv::imread(this->mImageFiles[this->mImageIdx + b]);
                cv::Mat rgb_img;
                cv::cvtColor(test_img, rgb_img, cv::COLOR_BGR2RGB);
                cv::Mat pad_dst;
                cv::Scalar value(0, 0, 0);
                auto scaleSize = cv::Size(inputW, inputH);
    }```

jstumpin commented 3 years ago

@batselem For the given test.jpg (4134x1653) example here, cv::cuda::Resize with the typical memory copy would give me, on average, 0.452ms while cv::resize's 1.631ms. The key to speed is to minimize overheads, namely H2D/D2H-copies. Thus, even if GPU resize is at par with CPU's, the overall latency would normally be in favor of the former method should we keep processing things persistently at one side of the hardware pipeline, e.g. for the said benchmark, cv::cudacodec::createVideoReader is used in lieu of cv::VideoCapture.

spacewalk01 commented 3 years ago

Thanks, @jstumpin I consider your suggestion!

CaoWGG / TensorRT-YOLOv4

Batch inference support #34