Mean average precision computation

colindecourt commented 2 years ago

Hi, While evaluating your model and reading the code, I noticed something weird in the computation of the mean average precision.

...
for class_i in all_gt_classes:
        ### NOTE: get the prediction per class and sort it ###
        pred_class = predictions[predictions[..., 7] == class_i]
        pred_class = pred_class[np.argsort(pred_class[..., 6])[::-1]]
        ### NOTE: get the ground truth per class ###
        gt_class = gts[gts[..., 6] == class_i]
        tp, fp = getTruePositive(pred_class, gt_class, input_size, \
                                iou_threshold=tp_iou_threshold, mode=mode)
...

When iterating over all_gt_classes only the detection having the same class as the ground truth are counted as false positive or true positive. Therefore, it doesn't count the detection that has a different label than one of those in a given image. As an example, in a spectrum where the ground truths are: 2 cars and 1 pedestrian and the predictions: 2 cars and 1 bicycle, using your code detection with the class bicycle don't seem to be counted as a false positive. Is there any reason for this choice? Am I wrong in my analysis? Thanks for your response.

ZhangAoCanada commented 2 years ago

Hi, the reference of mAP code in this repo is from Cartucho/mAP. The evaluation process is carried out as: first output the predictions from the model on the entire test set, then perform class-wise mAP calculation. The average AP of all the classes is computed as the average of all class-wise AP. So, the AP in the paper is not calculated under 1 or few input samples.

If you prefer to compute AP using limited input samples, you could change all_gt_classes to all the classes shown in config.json. Be careful about the situation where 0/0 appears.

Hope this could help.

colindecourt commented 2 years ago

Hi, I agree the AP of all the classes is computed over the entire test set. Precisely, it's rather the mean of the AP of all the classes for each image that is calculated than the mAP over the entire dataset.

The evaluation process is carried out as first output the predictions from the model on the entire test set, then perform class-wise mAP calculation.

In this case, for me, the true positives and the false negatives should be accumulated over the entire test set and then they should be used for the class-wise mAP calculation as here. This is how the mAP in the Pascal VOC challenge is computed. Also in Pascal VOC, if a detection doesn't match any ground truth class, it is considered as a false positive. But I don't see this in your code where this case is handled in your code.

Thanks for the clarification.

ZhangAoCanada / RADDet

Mean average precision computation #7