COCO eval produce wrong results

Nuwan1654 commented 1 year ago

When I ran test.py with --save-json argument it returns both AP with yolov7 calculations and pycocotool's evaluation, but these two are very different, following are the results. Also, I have edited the test.py to add integer image ids to best_predictions.json form the corresponding image of the ground truth annotation JSON file. So there is no errors like pycocotools unable to run: Results do not correspond to current coco set

yolov7 map calculations

Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████| 4755/4755 [06:25<00:00, 12.34it/s]
                 all        4755      118982       0.621       0.652        0.66       0.514
   laparotomy_sponge        4755       24410       0.831       0.702       0.785       0.575
       needle_driver        4755       13003       0.473       0.328       0.457       0.354
    needle_packaging        4755       13960       0.504       0.858       0.843       0.736
       suture_needle        4755       29388           0           0      0.0104     0.00276
  sharp_disposal_box        4755        4594       0.998        0.99       0.995       0.918
      scalpel_handle        4755        4424           0           0       0.115      0.0294
             scissor        4755        8644       0.517       0.912       0.535       0.414
                face        4755        4697       0.932       0.979       0.975       0.533
            name_tag        4755        4395       0.991       0.803       0.907       0.686
              person        4755       11467       0.965       0.949       0.978       0.892

coco evaluation

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.008, percategory = [ 0.008  0.027  0.031  0.000  0.000  0.004  0.000  0.000  0.005  0.000]
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.012, percategory = [ 0.014  0.039  0.045  0.000  0.000  0.018  0.000  0.000  0.006  0.000]
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.008, percategory = [ 0.008  0.029  0.033  0.000  0.000  0.000  0.000  0.000  0.006  0.000]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000, percategory = [ 0.000  0.000  0.000  0.000 -1.000  0.000  0.000 -1.000 -1.000 -1.000]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.006, percategory = [ 0.001  0.000  0.017  0.000  0.000  0.033  0.000 -1.000  0.000  0.000]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.058, percategory = [ 0.011  0.061  0.507  0.000  0.000  0.000  0.000  0.000  0.005  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000, percategory = [ 0.000  0.000  0.003  0.000  0.000  0.000  0.000  0.000  0.000  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.040, percategory = [ 0.015  0.037  0.165  0.000  0.000  0.101  0.004  0.033  0.047  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.177, percategory = [ 0.178  0.504  0.620  0.000  0.000  0.113  0.004  0.035  0.312  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000, percategory = [ 0.000  0.000  0.000  0.000 -1.000  0.000  0.000 -1.000 -1.000 -1.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.100, percategory = [ 0.088  0.141  0.534  0.000  0.000  0.116  0.000 -1.000  0.020  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.194, percategory = [ 0.182  0.535  0.777  0.000  0.000  0.088  0.004  0.035  0.315  0.000]

I am interested in the AP@.5:.95

Darbix commented 1 year ago

Did you please find a solution? I've trained a single class model for a face detection and after testing it shows YOLO metrics:

Class Images Labels P R mAP@.5 mAP@.5:.95: all 1218 23101 0.867 0.703 0.744 0.368

but the COCO metrics using instances_val2017.json with generated annotations in COCO format show totally different values:

Accumulating evaluation results... DONE (t=0.32s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.312 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.620 ...

dinusha94 commented 1 year ago

No, I couldn't find a solution yet

superkido511 commented 1 year ago

I think it's because test.py load COCO's groundtruth by default https://github.com/WongKinYiu/yolov7/blob/main/test.py#L257

tendermonster commented 1 year ago

I have the same problem. I fixed the test.py to work with custom data in coco format, but the evaluation using pycocotools looks very broken

capsule2077 commented 1 year ago

When I ran test.py with --save-json argument it returns both AP with yolov7 calculations and pycocotool's evaluation, but these two are very different, following are the results. Also, I have edited the test.py to add integer image ids to best_predictions.json form the corresponding image of the ground truth annotation JSON file. So there is no errors like pycocotools unable to run: Results do not correspond to current coco set

yolov7 map calculations

Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95: 100%|████████████████████████████████████████████████████████████████████| 4755/4755 [06:25<00:00, 12.34it/s]
                 all        4755      118982       0.621       0.652        0.66       0.514
   laparotomy_sponge        4755       24410       0.831       0.702       0.785       0.575
       needle_driver        4755       13003       0.473       0.328       0.457       0.354
    needle_packaging        4755       13960       0.504       0.858       0.843       0.736
       suture_needle        4755       29388           0           0      0.0104     0.00276
  sharp_disposal_box        4755        4594       0.998        0.99       0.995       0.918
      scalpel_handle        4755        4424           0           0       0.115      0.0294
             scissor        4755        8644       0.517       0.912       0.535       0.414
                face        4755        4697       0.932       0.979       0.975       0.533
            name_tag        4755        4395       0.991       0.803       0.907       0.686
              person        4755       11467       0.965       0.949       0.978       0.892

coco evaluation

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.008, percategory = [ 0.008  0.027  0.031  0.000  0.000  0.004  0.000  0.000  0.005  0.000]
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.012, percategory = [ 0.014  0.039  0.045  0.000  0.000  0.018  0.000  0.000  0.006  0.000]
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.008, percategory = [ 0.008  0.029  0.033  0.000  0.000  0.000  0.000  0.000  0.006  0.000]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000, percategory = [ 0.000  0.000  0.000  0.000 -1.000  0.000  0.000 -1.000 -1.000 -1.000]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.006, percategory = [ 0.001  0.000  0.017  0.000  0.000  0.033  0.000 -1.000  0.000  0.000]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.058, percategory = [ 0.011  0.061  0.507  0.000  0.000  0.000  0.000  0.000  0.005  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000, percategory = [ 0.000  0.000  0.003  0.000  0.000  0.000  0.000  0.000  0.000  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.040, percategory = [ 0.015  0.037  0.165  0.000  0.000  0.101  0.004  0.033  0.047  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.177, percategory = [ 0.178  0.504  0.620  0.000  0.000  0.113  0.004  0.035  0.312  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000, percategory = [ 0.000  0.000  0.000  0.000 -1.000  0.000  0.000 -1.000 -1.000 -1.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.100, percategory = [ 0.088  0.141  0.534  0.000  0.000  0.116  0.000 -1.000  0.020  0.000]
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.194, percategory = [ 0.182  0.535  0.777  0.000  0.000  0.088  0.004  0.035  0.315  0.000]

I am interested in the AP@.5:.95

I have the same problem.In my case, the key category_id in the best_predictions.json saved by yolov7 starts from 1, but the category_id in my own annotation starts from 0, so the category does not match, so I put the category_id of best_predictions.json All minus 1, the result is normal, but it is about 1map different from the calculation method of yolo

WongKinYiu / yolov7

COCO eval produce wrong results #1563