AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.7k stars 7.96k forks source link

Nested object detection #2965

Open pkhigh opened 5 years ago

pkhigh commented 5 years ago

I am working on a problem where I have to detect an object and then detect sub parts of the objects. For example: Finding the face in and image and then finding eyes in the face. When I train 2 different models, 1 for detecting Face only and other for detecting eyes in the face. I get detection confidence of 0.9+ in all the cases. But when I try to detect both face and eyes in the image with a single model then I cannot get a confidence score greater than 0.8+ on various images. I believe the problem is due to that fact that all the major object detection algorithms treat each object as an independent entity. Therefore it assumes that eyes are kind of creating occlusion on face. That is why confidence is lower. Is there a way where I can train a model which learns the inherent relations between the presence of different objects and sub-objects?

AlexeyAB commented 5 years ago

@pkhigh

Do you care about detection accuracy (mAP, TP/FP/FN, P-R, ...) or only a high degree of confidence?


When I train 2 different models, 1 for detecting Face only and other for detecting eyes in the face. I get detection confidence of 0.9+ in all the cases.

Do you use yolov3.cfg in both cases?

Do you crop the face from original image after finding face and before finding eyes?

If yes, then the face croped from original image has more information (pixels) than face on the image that is resized to the network size.

How Yolo sees face

Cropped face from original image: face_from_original

Face from image that is resized to network size: face_from_network


How Yolo sees the whole image: resized_to_network_size


original image: original

pkhigh commented 5 years ago

Hi @AlexeyAB , yes i do care about all the accuracy metrics. But my point is all the Architectures we use today be it yolo, ssd, faster rcnn, they all try to find different objects independently. I mean in an image of a face the eyes objects are actually treated as a separate independent object. The problem with this is that during nested object although sub object is a part of bigger object, it causes occlusion for the bigger object. Which reduces the score of bigger object when the smaller object is overlayed on it. Which should not happen in real scenario because both eyes and face are dependent objects. I am looking for a solution where i can tell the network to not treat objects as independent entities.

pkhigh commented 5 years ago

Actually, i am working on a dataset that is not on face. I just gave it as an example.

My actual dataset is finding a phone in image and in the phone finding any cracks in the image.

pkhigh commented 5 years ago

You can see the actual problem in the images below:

  1. Image with no damage.
  2. Image with a small damage.
  3. Image with a big damage.

See that the score of phone is reducing gradually. DSCF6493_0 jpgAnnotated old jpgAnnotated random jpgAnnotated

AlexeyAB commented 5 years ago

But my point is all the Architectures we use today be it yolo, ssd, faster rcnn, they all try to find different objects independently. ... I am looking for a solution where i can tell the network to not treat objects as independent entities.

So SSD, DSSD, Yolo does that you want out-of-box.


If you want to achive higher accuracy mAP/F1-score/TP/... then you should compare it without comparing Confidence.


When I train 2 different models, 1 for detecting Face only and other for detecting eyes in the face. I get detection confidence of 0.9+ in all the cases.

If you crops object (to find sub-objects) from the original image, then you just get image with higher resolution - it can lead to higher accuracy.

pkhigh commented 5 years ago

My point is still not clear i think.

Do you see the gradual decrease in the confidence score of phone when the area of damage is increasing ?

The point is. I have trained a single model with 2 classes phone and damage. Single model causes this issue.

But when i train 2 separate models with 1 class each i e. Phone and damage. Then the score of phone doesn't decrease no matter what the size of damage is.

This happens because in a single model damage is treated independently. And a bigger damage on phone cause the actual detection of phone to be reduced. Which in real case should not happen.

AlexeyAB commented 5 years ago

You didn't answer, did you crop part of image or not for 2 models.

Do you see the gradual decrease in the confidence score of phone when the area of damage is increasing ?

The reason: both objects phone and damage try to occupy the same final activation (same cell and anchor), because have approximately the same size and location. So one of the objects has to use a less suitable anchor - this gives less confidence.

May be confidence score decreases, but accuracy insreaces?

pkhigh commented 5 years ago

no I do not crop anything when i train models separately. I tweak my labels. No edition in input image. In both cases.

No. even accuracy isn't increased. I want to run these detections at high confidence. I also want a robust solution where someone has actually solved this problem of occlusion. Because if we consider logically damage will only occur on a phone. therefore the presence of damage must not change the confidence of phone(in real scenario).

pkhigh commented 5 years ago

@AlexeyAB so if i break the damage box into smaller boxes then their centre will shift and that will not occupy the cell where centre of phone falls?

Are there any other ways to tackle this problem?

there are cases where the confidence of both get very low: BC_16_damaged 2 jpgAnnotated

AlexeyAB commented 5 years ago

so if i break the damage box into smaller boxes then their centre will shift and that will not occupy the cell where centre of phone falls?

It will help a little. But there is a better solution - multi-label classificaion as in Yolo v3.

Are there any other ways to tackle this problem?

pkhigh commented 5 years ago

Earlier I was using yoloV3 for the detections. But I wasn't getting good good results. Also, probably because yolov3 is not good at localisation of big objects in images. The pictures above are from RETINANET. But, RETINANET also uses sigmoid for classification. Which means it uses multi label classification. So I don't think that matters. However, i will post the pictures from yoloV3 as well and the cfg file for reference.

AlexeyAB commented 5 years ago

@pkhigh Hi, did you get any results on Yolo v3 with nested objects?

pkhigh commented 5 years ago

@AlexeyAB I am sticking with two separate models as of now. 2 separate yoloV2 models. 1 for phone and another for damage. I tried training nested objects that is a combined model with 2 classes. But the results were poor be it yolov2, yolov3 or retinanet. Mostly the results get bad when the size of damage object increases compared to size of phone. Which seems like a problem of occlusion.

AlexeyAB commented 5 years ago

@pkhigh

Mostly the results get bad when the size of damage object increases compared to size of phone. Which seems like a problem of occlusion.

Can you show pair of such examples with bad detection by using Yolov3?

pkhigh commented 5 years ago

I have deleted the trained v3 models. I will re-train and post a few results with cfg file as well.

teshanshanuka commented 3 years ago

@pkhigh I am facing a similar problem so I am interested in this problem. Is there any update you can share here? What was your solution at the end?