Open pkhigh opened 5 years ago
@pkhigh
Do you care about detection accuracy (mAP, TP/FP/FN, P-R, ...) or only a high degree of confidence?
When I train 2 different models, 1 for detecting Face only and other for detecting eyes in the face. I get detection confidence of 0.9+ in all the cases.
Do you use yolov3.cfg in both cases?
Do you crop the face from original image after finding face and before finding eyes?
If yes, then the face croped from original image has more information (pixels) than face on the image that is resized to the network size.
How Yolo sees face
Cropped face from original image:
Face from image that is resized to network size:
How Yolo sees the whole image:
original image:
Hi @AlexeyAB , yes i do care about all the accuracy metrics. But my point is all the Architectures we use today be it yolo, ssd, faster rcnn, they all try to find different objects independently. I mean in an image of a face the eyes objects are actually treated as a separate independent object. The problem with this is that during nested object although sub object is a part of bigger object, it causes occlusion for the bigger object. Which reduces the score of bigger object when the smaller object is overlayed on it. Which should not happen in real scenario because both eyes and face are dependent objects. I am looking for a solution where i can tell the network to not treat objects as independent entities.
Actually, i am working on a dataset that is not on face. I just gave it as an example.
My actual dataset is finding a phone in image and in the phone finding any cracks in the image.
You can see the actual problem in the images below:
See that the score of phone is reducing gradually.
But my point is all the Architectures we use today be it yolo, ssd, faster rcnn, they all try to find different objects independently. ... I am looking for a solution where i can tell the network to not treat objects as independent entities.
Multi-stage detectors: Fast RCNN, Faster RCNN, ... - sees parts of image which are in the Regions which are achived by Region Proposal Network (RPN), but this Region canbe larger than object and it sees and takes into account more than just object (i.e. when it detects eye - it can see whole face and take it into account)
Single shot detectors: SSD, DSSD, Yolo ... - each final cell sees usually whole image, so it sees and takes into account more than just object (i.e. when it detects eye - it sees whole face and take it into account)
So SSD, DSSD, Yolo does that you want out-of-box.
If you want to achive higher accuracy mAP/F1-score/TP/... then you should compare it without comparing Confidence.
When I train 2 different models, 1 for detecting Face only and other for detecting eyes in the face. I get detection confidence of 0.9+ in all the cases.
If you crops object (to find sub-objects) from the original image, then you just get image with higher resolution - it can lead to higher accuracy.
My point is still not clear i think.
Do you see the gradual decrease in the confidence score of phone when the area of damage is increasing ?
The point is. I have trained a single model with 2 classes phone and damage. Single model causes this issue.
But when i train 2 separate models with 1 class each i e. Phone and damage. Then the score of phone doesn't decrease no matter what the size of damage is.
This happens because in a single model damage is treated independently. And a bigger damage on phone cause the actual detection of phone to be reduced. Which in real case should not happen.
You didn't answer, did you crop part of image or not for 2 models.
Do you see the gradual decrease in the confidence score of phone when the area of damage is increasing ?
The reason: both objects phone and damage try to occupy the same final activation (same cell and anchor), because have approximately the same size and location. So one of the objects has to use a less suitable anchor - this gives less confidence.
May be confidence score decreases, but accuracy insreaces?
no I do not crop anything when i train models separately. I tweak my labels. No edition in input image. In both cases.
No. even accuracy isn't increased. I want to run these detections at high confidence. I also want a robust solution where someone has actually solved this problem of occlusion. Because if we consider logically damage will only occur on a phone. therefore the presence of damage must not change the confidence of phone(in real scenario).
@AlexeyAB so if i break the damage box into smaller boxes then their centre will shift and that will not occupy the cell where centre of phone falls?
Are there any other ways to tackle this problem?
there are cases where the confidence of both get very low:
so if i break the damage box into smaller boxes then their centre will shift and that will not occupy the cell where centre of phone falls?
It will help a little. But there is a better solution - multi-label classificaion as in Yolo v3.
Are there any other ways to tackle this problem?
yolov3.cfg
?Earlier I was using yoloV3 for the detections. But I wasn't getting good good results. Also, probably because yolov3 is not good at localisation of big objects in images. The pictures above are from RETINANET. But, RETINANET also uses sigmoid for classification. Which means it uses multi label classification. So I don't think that matters. However, i will post the pictures from yoloV3 as well and the cfg file for reference.
@pkhigh Hi, did you get any results on Yolo v3 with nested objects?
@AlexeyAB I am sticking with two separate models as of now. 2 separate yoloV2 models. 1 for phone and another for damage. I tried training nested objects that is a combined model with 2 classes. But the results were poor be it yolov2, yolov3 or retinanet. Mostly the results get bad when the size of damage object increases compared to size of phone. Which seems like a problem of occlusion.
@pkhigh
Mostly the results get bad when the size of damage object increases compared to size of phone. Which seems like a problem of occlusion.
Can you show pair of such examples with bad detection by using Yolov3?
I have deleted the trained v3 models. I will re-train and post a few results with cfg file as well.
@pkhigh I am facing a similar problem so I am interested in this problem. Is there any update you can share here? What was your solution at the end?
I am working on a problem where I have to detect an object and then detect sub parts of the objects. For example: Finding the face in and image and then finding eyes in the face. When I train 2 different models, 1 for detecting Face only and other for detecting eyes in the face. I get detection confidence of 0.9+ in all the cases. But when I try to detect both face and eyes in the image with a single model then I cannot get a confidence score greater than 0.8+ on various images. I believe the problem is due to that fact that all the major object detection algorithms treat each object as an independent entity. Therefore it assumes that eyes are kind of creating occlusion on face. That is why confidence is lower. Is there a way where I can train a model which learns the inherent relations between the presence of different objects and sub-objects?