KoKoLates / snpe-yolov7-inference

A yolov7 tiny model inference applied on qualcomm snpe for pedestrian detection with embedded system.
https://github.com/KoKoLates/snpe-yolov7-inference/wiki
MIT License
9 stars 3 forks source link

The output result when running the yolov7 converted DLC shows many bounding boxes. #2

Closed HuiJu1218 closed 2 months ago

HuiJu1218 commented 2 months ago

Hello, I used the official yolov7 pre-trained model 'yolov7.pt and followed the instructions in the wiki to convert 'yolov7.pt' into ONNX, and then into DLC. I'm using Qualcomm SDK version 2.22.6.

Additionally, I directly compiled and executed the application in a Docker environment using x86 architecture. The execution was successful, but the post-processing part shows many bounding boxes, which seems like there is an issue with NMS.

I suspect that there might be an issue with my conversion process because I didn't encounter the warnings mentioned in the wiki when converting the official pre-trained model. Is there any way to help me verify this? Or could you provide the converted DLC for me to try?

Thanks.

KoKoLates commented 2 months ago

Hi, Can you provide some output images or screenshots?

HuiJu1218 commented 2 months ago

Below are the resources I used and the output results.

KoKoLates commented 2 months ago

Hi,

I think there are a few adjustments you could try. First, you can modify the threshold for class confidence or the nms in the line#101 and line#129 of the object detection file. Observe if these changes lead to any improvements, although I don't believe this will make a significant difference.

I suspect the main issue might lie in the postprocess() function. The code I provided is primarily for yolov7-tiny, so if you're working with yolov7, you'll need to modify the code to align with that model's architecture.

  1. The size of feature map
    float size[3] = {80, 40, 20};
  2. Strides
    float strides[3] = {8, 16, 32};
  3. The grid of anchors
    float anchorGrid[][6] = {
    { 12, 16, 19, 36, 40, 28},
    { 36, 75, 76, 55, 72,146},
    {142,110,192,243,459,401}
    };
  4. Index shifting in line#74
    if (i == 1) {
    index += 19200;
    } else if (i == 2) {
    index += 24000;
    }

I'm not very sure if the values I provided above are all correct, so it would be a good idea to double-check them before compilation. Besides, make sure you've trimmed the model when converting it to *.dlc format, as snpe cannot handle 5-dimensional data types (this usually occurs before the Reshape layer).

The grid size is set to 28 because I've trained my model with only 23 classes, and I’ve also included the bounding box information (x, y, height and width) along with the confidence score.

HuiJu1218 commented 2 months ago

Hi, I successfully used the DLC for inference following your advice,thansk a lot.Here is how I made the adjustments.

  1. Chaneg size, stride and anchor to yolov7
    float size[3]={20, 80, 40};
    float strides[3] = {32,8,16};
    float anchorGrid[][6] = {
        {142,110,192,243,459,401},
        { 12, 16, 19, 36, 40, 28},
        { 36, 75, 76, 55, 72,146}
    };
  2. Modify index to 600 and 19200
    if (i == 1) { 
      index += 6000; 
    } else if (i == 2) {
      index += 19200;
    }
  3. Change 28 (23 + 5) to 85 (80 +5)

But I still have a few questions. Thanks.

  1. I would like to understand the calculation of the index. From what I know, it seems to be related to the three feature map sizes in YOLOv7, which are 20x20, 40x40, and 80x80. However, I'm unable to deduce how to handle the index when i=1 and i=2.
  2. Why is it necessary to first filter by confidence#101 and then filter the results again by score#116?
  3. Lastly, I'm currently running inference on x86 using the DLC in float32, but the recognition accuracy is noticeably worse compared to before the conversion. Do you have any further suggestions or recommendations?"
KoKoLates commented 2 months ago

I would like to understand the calculation of the index. From what I know, it seems to be related to the three feature map sizes in YOLOv7, which are 20x20, 40x40, and 80x80. However, I'm unable to deduce how to handle the index when i=1 and i=2.

In your case, i==0 is 20x20 feature map, i==1 is 80x80 feature map and i==2 is 40x40 feature map. Thus, to my understanding,

if (i==1) {
  index += 1200;   // 20x20x3
} else if (i==2) {
  index += 20400; // 20x20x3+80x80x3
}

It seems has some difference to yours.

Why is it necessary to first filter by confidence#101 and then filter the results again by score#116?

My purpose here is to focus only on those bounding boxes with higher confidence scores to quickly eliminate boxes with lower confidence scores, thereby reducing the number of subsequent calculations. This approach avoids performing further classification score calculations on all candidate boxes, decreasing unnecessary computation, which can significantly improve efficiency, especially when dealing with a very large number of candidate boxes.

Lastly, I'm currently running inference on x86 using the DLC in float32, but the recognition accuracy is noticeably worse compared to before the conversion

I don't quite understand what you mean here.

HuiJu1218 commented 2 months ago

Hi! Thank you for response. After our discussion, I found the other way. In the end, I used the command python export.py --weights yolov7.pt --grid --simplify --img-size 640 640 to convert the model to ONNX during the conversion process. Then, I performed the post-processing, and the results were quite good. Thank you!