cocodataset / cocoapi

COCO API - Dataset @ http://cocodataset.org/
Other
6.06k stars 3.75k forks source link

cocoeval GT-DT matching implementation is wrong #564

Open bertsky opened 2 years ago

bertsky commented 2 years ago

I don't know how this could go undetected for so long (cannot find any mention in the issues), but I am certain there is a bug in the algorithm that is used to match DT (detection) and GT (ground-truth) objects. It gets more severe when you use a lower iouThr, but you can always have pathological cases even at high degrees of overlap. Both the Python and the Matlab implementation are affected.

The basic idea of evaluateImg(imgId, catId) algorithm is to (for each iouThr value independently)

Now, the outer-inner loop structure already dictates asymmetry: If there are multiple potential matches, then we obviously cannot get the best DT match per GT object – only the best GT match per DT object. That's already a problem in itself (e.g. 2 DTs could "share" 1 GT, but the GT would have to "choose"; ideally the data structure and the algorithm for filling it should be rich enough to completely map multiple matches, symmetrically). But one could live with that asymmetry (and if need be for the other preference, just call COCOeval in reverse direction).

However, it gets worse by an additional criterion that seemingly addresses this – but in the wrong way:

https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L279-L281 https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/MatlabAPI/CocoEval.m#L383-L384

That effectively ruins even the other criterion (of selecting the best GT match per DT object), replacing it with an arbitrary choice (of whichever DT came first to "reserve" the GT). We end up with up to 1 GT per DT (the "best free") and up to 1 DT per GT (the "first best"). It's only "up to" 1, because obviously we can now miss the only match for a pair (even if it is actually better IoU-wise) when its GT is already reserved from some earlier (i.e. higher-scoring) DT.

I wonder what motivation the authors had for that additional criterion. Could it be speed (i.e. shortcutting the combinatorial explosion of comparisons)? Or was it consistency (i.e. ensuring choices in gtMatches always mirror those in dtMatches)?

IMO one should modify that criterion to at least compare the IoU of the previous DT match of the GT object with the current pair's IoU: iff the former is actually larger than the latter, then keep it, but otherwise drop it (i.e. reassign the GT) or add it (i.e. allow dtMatches to contain the same GT id multiple times, while gtMatches only contains the largest DT id).

Please correct me if I am getting it all wrong! (Also, if you have better ideas, or different perspectives.)

andreaceruti commented 2 years ago

@bertsky I do not think this repo is maintained from the authors, anyway can you give an implementation of your solution to the problem? I think you are right in your point

bertsky commented 2 years ago

@andreaceruti, no I did not bother with pycocotools for that purpose (but it would not be difficult to do based on the above analysis I believe). I went for my own algorithm here ff. to get truly n:m matches (and FP and FN and pixel-wise measures and over/undersegmentation measures).

andreaceruti commented 2 years ago

@bertsky Wow, really nice work! If i would like to evaluate my custom dataset on instance segmentation and object detection task using your implementation could I just use the evaluate_coco function? Or do you think I should do some other changes? Anyway I will go more in depth with your code in these days since it is documented better than this repo :)

bertsky commented 2 years ago

@andreaceruti Yes, evaluate_coco if you already have COCO datasets – but there is no CLI for that (yet). If you have PAGE-XML, you can use standalone_cli (which gets installed as page-segment-evaluate via entry_points).

andreaceruti commented 2 years ago

@bertsky I am almost there. Yes I have 2 json files reresenting coco_gt and coco_dt. I have applied your method and then I just look at the dict constructed by the method. So the first part summarizes all the dataset, and then you have added one dict for each image. So for example I have this dict for the entire dataset

" 'by-category': {'grape bunch': {'IoDT': 0.8433501965488758, 'IoGT': 0.8414118188031777, 'IoU': 0.7247137091559454, 'oversegmentation': 0.00469693580854395, 'pixel-iou': 0.9503520788133711, 'pixel-precision': 0.9917968516288882, 'pixel-recall': 0.9578813700668292, 'segment-precision': 0.9752851711026616, 'segment-recall': 0.8496732026143791, 'undersegmentation': 0.007393324883819181}}"

And this for one sample image "'CDY_2043.jpg': {'oversegmentation': {'grape bunch': 0.018181818181818184}, 'pixel_iou': {'grape bunch': 1.0}, 'pixel_precision': {'grape bunch': 1.0}, 'pixel_recall': {'grape bunch': 1.0}, 'precision': {'grape bunch': 1.0}, 'recall': {'grape bunch': 1.0}, 'true_positives': {'grape bunch': [{'DT.ID': 344, 'DT.area': 37258, 'GT.ID': 2031, 'GT.area': 47830, 'I.area': 36497, 'IoDT': 0.9795748564066777, 'IoGT': 0.7630566590006272, 'IoU': 0.7511061719248421}, .... {'DT.ID': 354, 'DT.area': 4067, 'GT.ID': 2036, 'GT.area': 11346, 'I.area': 3794, 'IoDT': 0.9328743545611016, 'IoGT': 0.33439097479287855, 'IoU': 0.32653412513985713}]}, 'undersegmentation': {'grape bunch': 0.0}}"

If I understand correctly the idea behind these metrics are taken from "rethinking semantic segmentation evaluation" paper, but could you explain to me how could I obtain AP,TPs,FPs,FNs for instance segmentation task?

andreaceruti commented 2 years ago

@bertsky this is the image I have used as example. On the left you can see the ground truth and on the right you can see the detections example

bertsky commented 2 years ago

@andreaceruti this is off topic here and too verbose – let's discuss in ocrd_segment

volcanolee4 commented 2 years ago

@bertsky Hi! I don't think it is an arbitrary choice in cocoeval.evaluate() function. (as you say ,the arbitrary choice means "of whichever DT came first to "reserve" the GT"). Before a dt find its correct gt,the list of dt has already been sorted by their class confidence score, so if a dt can "come first",that means this dt has a higher class confidence score.In conclusion,if a gt meets more then one dt,the dt who has the highest class confidence score can match this gt.

bertsky commented 2 years ago

@volcanolee4 I did not say the choice was random though. (In fact, I did allude to the fact the candidates are sorted by confidence with my formulation GT is already reserved from some earlier – i.e. higher-scoring – DT.)

But (given the actual problem to solve here – how to adequately compare a given prediction with the true segmentation) I insist that choice is arbitrary: any DT candidate with just slightly higher confidence but much lower relative or absolute overlap can replace the natural best fit for a GT region. (And that assignment then not only becomes exclusive for the GT, it can also prevent the DT from matching at all.) It is one thing to factor in the confidence of a predictor, but another to properly assess its accuracy. The current implementation not only conflates the two, it makes both impossible.