AP is not invariant to shuffling the order of detections

I'm running the example in pycocoEvalDemo.ipynb. If I shuffle the order of the detections, then in certain shuffles, I get different AP results.

Shuffling:

import json
import random 
anns = json.load(open(resFile))
random.shuffle(anns)
resFile2 = resFile.replace('results.json', 'results2.json')
json.dump(anns, open(resFile2, 'w'), separators=(',', ':'))

Now eval using shuffled file, replace: cocoDt=cocoGt.loadRes(resFile) with cocoDt=cocoGt.loadRes(resFile2 )

With the original detections file, I get the following results:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.50458
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.69697
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.57298
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.58563
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.51940
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.50140
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.38681
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.59368
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.59535
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.63981
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.56642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.56429

And after shuffling, I get:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.50458
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.69786
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.57293
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.58564
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.51940
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.50140
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.38600
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.59389
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.59557
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.64012
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.56642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.56429

Notice AP@50 changes from 0.69697 to 0.69786! I'm using the same detections, but the results are slightly different!

After some analysis, the bug in the AP calculation seems to arise from the accumulate function, the results are ordered by the dtScores in line 366 inds = np.argsort(-dtScores, kind='mergesort')

The problem happens when several detections have the exact same score, but they have different dtMatches values. The order in which they appear after sort is determined by the order in which they appear in the original detections file. Thus, if the detections have different dtMatches values, some are matched, and some are not, then the final AP calculation is affected by this order.

One way to solve the problem, is to sort by dtScores, and use dtMatches as a tie breaker, thus giving matched detections precedence in the sort. This will solve the bug, and the AP will then be invariant to changes in the input order of detections. But solving this bug will break the current implementation - i.e. new reported scores might differ for some users from their current scores.

Possible fix by changing lines 362-366 in cocoeval.py with:

dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])
dtMatches = np.concatenate([e['dtMatches'][0:maxDet] for e in E])

# different sorting method generates slightly different results.
# mergesort is used to be consistent as Matlab implementation.
inds = np.lexsort((np.logical_not(dtMatches), -scores))

cocodataset / cocoapi

AP is not invariant to shuffling the order of detections #650