DanaHan / Yolov5-in-Deepstream-5.0

Describe how to use yolov5 in Deepstream 5.0
209 stars 57 forks source link

problem when BATCH_SIZE > 1 #12

Closed rho-sk closed 3 years ago

rho-sk commented 3 years ago

I am trying make it work with batch size > 1 Actual version i am working with is yolov5, 3.1 with 23 classes, 640x640 Device is jetson nano

Tensorrtx tests For reason that 3.1 is changed a little i took actual version of tensorrtx without hswish

  1. tensorrtx was build with 23 classes
  2. when tested tensorrtx with batch size 1 works fine
  3. i rebuild tensorrtx with batch size 8 and regenerated engine file with max batch size = 8
  4. tested with tensorrtx inference ( yolov5 -d ../samples ) - results are ok ( ... 23 classes, batchsize 8, 640x640 )

so on jetson nano on JETSON_CUDA=10.2.89 works fine

deepstream 5.0 nvdsinfer_custom_impl_Yolo on jetson nano on JETSON_CUDA=10.2.89 i configured yolov5s for deepstream 5.0

  1. when tested with batch size 1 engine, it works (but ocasionally "boxes explodes" ) - looks like memory is not cleared between cycles (but just my assumption)

  2. when tested with engine with max batch size = 8 and setting batch size even 1 "boxes explodes" very often. Tracking has no affect to this functionality - in test it was turned off.

  3. With engine (batch size =1) there is also strange behavior with 2 streams in parallel - in deepstream. First stream is correct, second one has "exploded boxes" behavior. So i think problem is related to this one too.

Examples how explosion looks like explode

Expected (sometimes) sometimes ok

rho-sk commented 3 years ago

I was able to fix it. Problem is in tensorrtx's yololayer.cu (actual version at 16.11.2020 - to be able to work with 3.1 yolov5 version

important is to keep stream context and do it in async line 222:

            CUDA_CHECK(cudaMemsetAsync(output + idx*outputElem, 0, sizeof(float), stream));

line (important is to keep stream context) : 233

            CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>
                (inputs[i], output, numElem, mYoloV5NetWidth,  mYoloV5NetHeight, mMaxOutObject, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount, outputElem);

Result works like a charm now (batchsize 2 in this example)

resolved