Inconsistency of test-dev result

tjqansthd commented 2 years ago

Hi, thanks for your great work!

I tested your pre-trained model (R-101 3x) on test-dev2017 in the coco evaluation server. when I extract only mask results and save to json files, the segmentation score is matched the results reported on the paper. However, when I save mask results with box results (generated by 511~514 lines in sotr.py) to json files, the score (AP_s, AP_m, AP_l) is different from the paper.

This is the result from json file which is consist of only mask information

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.402
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.612
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.434
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.102
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.590
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.731
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.328
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.512
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.536
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.301
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.590
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.733

This is the result from json file which is consist of mask information with box information

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.402
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.612
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.434
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.194
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.440
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.552
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.328
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.512
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.536
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.301
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.590
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.733

As can be seen, (AP_s, AP_m, AP_l) of first result and second result are (0.102, 0.590, 0.731) and (0.194, 0.440, 0.552), respectively. I don't know the detailed process of the coco evaluation server, so I wonder why there is a difference in segmentation ap due to the presence or absence of box information.

QuLiao1117 commented 2 years ago

Thanks for your attention. We did not perform an in-depth investigation about this issue, maybe you can find the reason from https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L163.

tjqansthd commented 2 years ago

The code seems to calculate iou by dividing the case of the box and the case of segmentation independently. And All the AP is the same but only the case of AP with respect to the area is different... I haven't found out exactly which part is the problem.

easton-cau commented 2 years ago

For more detailed testing questions, please contact the COCO contributors.

tjqansthd commented 2 years ago

In my opinion, 'loadRes' function of 305 lines in https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/coco.py cause a difference of result about area. As can be seen in 331 line and 341 line, ann['area'] is differently calculated whether or not there is box information. (ann['area'] = bb[2]*bb[3] when there is box information and ann['area'] = maskUtils.area(ann['segmentation']) when there isn't)

when I run evaluation of SOLOv2_R-101_3x from AdelaiDet on test-dev, it shows a similar trend where (AP_s, AP_m, AP_l) is (0.095, 0.574, 0.714) when there is box information, and (0.178, 0.427, 0.546 - similar result of SOLOv2 paper) when there isn't.

QuLiao1117 commented 2 years ago

Thanks very much for your issue. By reading the code, we found that these two calculation methods do lead to differences in area calculation. We have added a note in readme to illustrate this problem.

easton-cau / SOTR

Inconsistency of test-dev result #9