[lecture13][0612] Inquiry on the Specific Limitations of YOLO-v1 Model

PiLab-CAU / ComputerVision-2401

Computer Vision Course 2024-01

Apache License 2.0

9 stars 3 forks source link

I am writing to inquire about the specific limitations of the YOLO-v1 model as discussed in our recent lecture. YOLO-v1, while being an innovative and efficient object detection model, is known to have several limitations that impact its performance. I would like to understand these limitations better and verify their validity.

Could you please elaborate on the following points regarding YOLO-v1's limitations?

Detection of Multiple Objects in a Single Grid Cell: It has been noted that YOLO-v1 struggles to detect multiple objects within a single grid cell. How does this limitation affect the model’s performance in dense object scenarios?
Handling of Small Objects: The model reportedly has difficulties with small object detection due to its grid cell approach favoring larger objects. What are the specific challenges YOLO-v1 faces with small objects, and are there any particular cases where this limitation is most evident?
Bounding Box Regression Issues: YOLO-v1’s bounding box predictions can sometimes be inaccurate, leading to poor localization. How significant is this issue in practical applications, and are there known methods to mitigate it?

I would appreciate a detailed explanation of these points to understand the limitations of YOLO-v1 better. Additionally, if there are any insights or counterarguments that might provide a more balanced view, I would be very interested in hearing them.

Thank you.

@zxcv3296 Thank you for the comment!

By default, YOLO V1 assigns a single image to each 7x7 grid, enabling the detection of up to 49 objects in a scene. However, this architecture makes YOLOv1 incapable of handling dense object scenarios. It's worth noting that at that time, other real-time detectors like DPM also struggled with dense-object scenarios.
YOLO V1 operates with an input size of 448x448, resulting in each of the 49 grids having a resolution of 64x64. Since YOLOv1 represents bounding box positions and width/height in the range of 0 to 1 using grid size and relative grid point positions, it is difficult to handle objects smaller than the grid size.

Those limitations in YOLO v1 are critical for dense object detection s.a.

So, many face detectors follow the architectures from SSD instead of YOLO (S3FD: Single Shot Scale-invariant Face Detector)

Failure in bounding box regression leads to two main issues: (1) Qualitatively, our predicted box cannot tightly capture the object region, reducing the applicability of the prediction result. (2) Quantitatively, a loosely captured bounding box will have a low IOU (Intersection over Union) score. In the mAP metric, a low IOU predicted box is considered positive but incorrect (false), which lowers our mAP score.

(would you let me know your name? It's for recording the scores.)

PiLab-CAU / ComputerVision-2401

[lecture13][0612] Inquiry on the Specific Limitations of YOLO-v1 Model #42