cocodataset / cocoapi

COCO API - Dataset @ http://cocodataset.org/
Other
6.11k stars 3.76k forks source link

F1 score calculation #572

Open andreaceruti opened 2 years ago

andreaceruti commented 2 years ago

Has someone managed to find the right way to calculate the F1 score for different tasks?

shivamsnaik commented 2 years ago

Hi, I have written a custom script for calculating these metrics.

You can head to the below link to fetch the code files needed to calculate them:

https://gist.github.com/shivamsnaik/c5c5e99c00819d2167317b1e56871187

andreaceruti commented 2 years ago

@shivamsnaik I think what you are calcuting are not the standard definition of precision, recall and f1-score. For the precision defined as TP/ (TP+FP) and recall TP/ (TP+FN) I think we have to calculate the TP,FP,FN for each image and then average through the dataset. The problem is that the API is not so user-friendly and so it is a mess to find these values

shivamsnaik commented 2 years ago

The TP, FP, etc. are being calculated per image, per iou threshold in the COCOeval.accumulate() function.

But COCO has not provided any code to finally combine all the information and reduce the metrics to a single value.

I just take the precision, recall, and scores received from COCOeval.accumulate() and compute the final metric.

andreaceruti commented 2 years ago

@shivamsnaik Yes I have seen your code, But i do not think that for example when you take the precision, averaging it by the recall values ( precesion_iou = precision[iou_lookup[iou], :, :, 0, -1].mean(1) ) and then averaging it taking the mean of this values is what I am looking for.

This is my problem, I have on the left the grounf truth annotations of an image and on the right what my model infer inference To calculate the Precision and Recall of my image this is the code I use:

select one image

imgIds = [147] coco_eval = COCOeval(coco_gt, coco_dt, "segm") coco_eval.params.imgIds = imgIds coco_eval.evaluate()

now I will use the dict produced by the evaluate() function to calculate all

here i simply select the dict that corresponds to 'aRng': [0, 10000000000.0]

image_evaluation_dict = coco_eval.evalImgs[0]

select the index related to IoU = 0.5

iou_treshold_index = 0

all the detections from the model, it is a numpy of True/False (In my case they are all False)

detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]

here we consider the detection that we can not ignore (we use the not operator on every element of the array)

mask = ~detection_ignore

detections number

n_ignored = detection_ignore.sum()

and finally we calculate tp, fp and the total positives

tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum() fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum() n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()

recall = tp / n_gt precision = tp / (tp + fp) f1 = 2 precision recall / (precision + recall)

Do you think this is good or do you think that something is wrong with my reasoning?

shivamsnaik commented 2 years ago

@andreaceruti Your approach gives the specific True Positive, F.P. etc. values too. This is really useful. Thanks for sharing it with me. It has helped me understand the COCO.evaluate() results and the purpose of their properties more.

andreaceruti commented 2 years ago

@shivamsnaik thanks for your words, they are really appreciated. I have just adapted what I have seen in a github folder, so I can't take credits of this computation. But since calculating these values is a common problem I hope that also other people would jump into the discussion and try to figure out if there is something wrong.

hafiz031 commented 9 months ago

I have modified @andreaceruti's code a bit to work on the complete test set. I also have some observations including on the library implementation as well (correct me if the observations are not correct):

Anyway, the following is my updated code:

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import numpy as np
import pandas as pd

ground_truth_annotations = [
    {'id': 1, 'image_id': 0, 'category_id': 1, 'bbox': [200, 200, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
    {'id': 2, 'image_id': 0, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
    {'id': 3, 'image_id': 2, 'category_id': 1, 'bbox': [150, 130, 100, 100], 'ignore': 1, 'iscrowd': 1, 'area': 10000},
    {'id': 5, 'image_id': 2, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
    {'id': 4, 'image_id': 4, 'category_id': 1, 'bbox': [80, 90, 100, 100], 'ignore': 1, 'iscrowd': 1, 'area': 10000}
]

predicted_annotations = [
    {'id': 5, 'image_id': 0, 'category_id': 1, 'bbox': [250, 250, 100, 100], 'score': 1, 'area': 10000},
    {'id': 1, 'image_id': 0, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'score': 1, 'area': 10000},
    {'id': 4, 'image_id': 2, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'score': 0.1, 'area': 10000},
    {'id': 3, 'image_id': 4, 'category_id': 1, 'bbox': [250, 250, 100, 100], 'score': 0.9, 'area': 10000}
]

image_info_gt = [
                {'id': 0, 'file_name': '1.jpg', 'height': 500, 'width': 375},
                {'id': 2, 'file_name': '2.jpg', 'height': 435, 'width': 450},
                {'id': 4, 'file_name': '3.jpg', 'height': 375, 'width': 500}
            ]

# Create separate COCO instances for ground truth and predicted annotations
coco_gt = COCO()
coco_gt.dataset = {'images': image_info_gt, 'annotations': ground_truth_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_gt.createIndex()

coco_dt = COCO()
coco_dt.dataset = {'images': image_info_gt, 'annotations': predicted_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_dt.createIndex()

# Create COCOeval object
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')

# Set parameters
coco_eval.params.areaRng = [[0 ** 2, 1e5 ** 2]]
coco_eval.params.areaRngLbl = ["all"]

coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
print("=" * 30)

pd.DataFrame(coco_eval.evalImgs).to_excel("coco_eval.xlsx", index = False)

total_tp = 0
total_fp = 0
total_gt = 0
epsilon = 0.00000000001 # To avoid divide by 0 case

# Select the index related to IoU = 0.5
iou_treshold_index = 0 # iouThrs - [.5:.05:.95] T=10 IoU thresholds for evaluation, so, index 0 is threshold=0.5

for image_evaluation_dict in coco_eval.evalImgs:
    print("-" * 30)

    # All the detections from the model, it is a numpy of True/False
    detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]

    # Here we consider the detection that we can not ignore (we use the not operator on every element of the array)
    mask = ~detection_ignore
    print(f"Mask on Detection [NotIgnored]: {mask}")

    n_ignored = detection_ignore.sum()
    print(f"Ignore count from detected bboxes from [image: {image_evaluation_dict['image_id']}]: {n_ignored}")

    # And finally we calculate tp, fp and the total positives (n_gt)
    tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum()
    fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum()
    n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()

    per_example_precision = tp / max((tp + fp), epsilon)
    per_example_recall = tp / max(n_gt, epsilon)
    per_example_f1 = 2 * per_example_precision * per_example_recall / max((per_example_precision + per_example_recall), epsilon)

    print(f"Precision [image: {image_evaluation_dict['image_id']}]: {per_example_precision}")
    print(f"Recall [image: {image_evaluation_dict['image_id']}]: {per_example_recall}")
    print(f"F1 score [image: {image_evaluation_dict['image_id']}]: {per_example_f1}")

    total_tp += tp
    total_fp += fp
    total_gt += n_gt

precision = total_tp / max((total_tp + total_fp), epsilon)
recall = total_tp / max(total_gt, epsilon)
f1 = 2 * precision * recall / max((precision + recall), epsilon)

average_precision = coco_eval.stats[0]
average_recall = coco_eval.stats[8]

print("=" * 30)
print(f"PRECISION: {precision}")
print(f"RECALL: {recall}")
print(f"F1_SCORE: {f1}")
print(f"AVERAGE_PRECISION: {average_precision}")
print(f"AVERAGE_RECALL: {average_recall}")

############################### CROSS VALIDATION ############################
import numpy as np
import cv2

for img in image_info_gt:
    dummy = np.ones((img["height"], img["width"], 3), dtype=np.uint8) * 255
    print(img)
    for gt in ground_truth_annotations:
        if gt["image_id"] == img["id"]:
            overlay = dummy.copy()
            cv2.rectangle(overlay, gt["bbox"][:2], np.array(gt["bbox"][2:])+np.array(gt["bbox"][:2]), 
                          (0, 255, 0), cv2.FILLED)
            alpha = 0.3  # Transparency factor.
            dummy = cv2.addWeighted(overlay, alpha, dummy, 1 - alpha, 0)
    for pr in predicted_annotations:
        if pr["image_id"] == img["id"]:
            overlay = dummy.copy()
            cv2.rectangle(overlay, pr["bbox"][:2], np.array(pr["bbox"][2:])+np.array(pr["bbox"][:2]), 
                          (0, 0, 255), cv2.FILLED)
            alpha = 0.3  # Transparency factor.
            dummy = cv2.addWeighted(overlay, alpha, dummy, 1 - alpha, 0)
    cv2.imshow("image", dummy)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
#############################################################################

This seems to be working for me. Feel free to provide any corrections.

mxw20010804 commented 3 months ago

@hafiz031 @andreaceruti im sorry to bother you. i try your code and i got the result like this.

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.362
 **Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.553**
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.398
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.013
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.324
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.389
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.250
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.594
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.021
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.625
==============================
**PRECISION: 0.066057091882248**
RECALL: 0.8935682394111258
F1_SCORE: 0.1230199274007991
AVERAGE_PRECISION: 0.3617328617872838
AVERAGE_RECALL: 0.5938189911568441

iou_treshold_index is set to 0. Then iou is 0.5. Why my PRECISION = 0.066 and RECALL = 0.89. But the coco_eval say the Precision is 0.553 when iou=0.5

hafiz031 commented 2 months ago

@mxw20010804 you might be confusing precision with average precision. Precision (not average precision), Recall and F-score are not reported by the library. I have added custom code to calculate those. Learn more on Precision, Recall and Mean Average Precision.

soeb-hussain commented 2 months ago

``Hi @hafiz031,

Thanks for sharing the code, it doesnt seem to be working for me. I am running below script for sanity test, and I am getting following values as output: PRECISION: 0.0 RECALL: 0.0 F1_SCORE: 0.0 AVERAGE_PRECISION: 0.0 AVERAGE_RECALL: 0.0

I was expecting precision and recall to be 1, let me know your thoughts.

Cheers

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import numpy as np
import pandas as pd

ground_truth_annotations = [
    {'id': 0, 'image_id': 0, 'category_id': 1, 'bbox': [100, 100, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
]

predicted_annotations = [
    {'id': 0, 'image_id': 0, 'category_id': 1, 'bbox': [101, 101, 100, 100], 'score': 1, 'area': 10000},
]

image_info_gt = [
                {'id': 0, 'file_name': '1.jpg', 'height': 500, 'width': 375},
            ]

# Create separate COCO instances for ground truth and predicted annotations
coco_gt = COCO()
coco_gt.dataset = {'images': image_info_gt, 'annotations': ground_truth_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_gt.createIndex()

coco_dt = COCO()
coco_dt.dataset = {'images': image_info_gt, 'annotations': predicted_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_dt.createIndex()

# Create COCOeval object
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')

# Set parameters
coco_eval.params.areaRng = [[0 , 1e5 ** 2]]
coco_eval.params.areaRngLbl = ["all"]

coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
print("=" * 30)

# pd.DataFrame(coco_eval.evalImgs).to_excel("coco_eval.xlsx", index = False)

total_tp = 0
total_fp = 0
total_gt = 0
epsilon = 0.00000000001 # To avoid divide by 0 case

# Select the index related to IoU = 0.5
iou_treshold_index = 3 # iouThrs - [.5:.05:.95] T=10 IoU thresholds for evaluation, so, index 0 is threshold=0.5

for image_evaluation_dict in coco_eval.evalImgs:
    print("-" * 30)

    # All the detections from the model, it is a numpy of True/False
    detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]

    # Here we consider the detection that we can not ignore (we use the not operator on every element of the array)
    mask = ~detection_ignore
    print(f"Mask on Detection [NotIgnored]: {mask}")

    n_ignored = detection_ignore.sum()
    print(f"Ignore count from detected bboxes from [image: {image_evaluation_dict['image_id']}]: {n_ignored}")

    # And finally we calculate tp, fp and the total positives (n_gt)
    tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum()
    fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum()
    n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()

    per_example_precision = tp / max((tp + fp), epsilon)
    per_example_recall = tp / max(n_gt, epsilon)
    per_example_f1 = 2 * per_example_precision * per_example_recall / max((per_example_precision + per_example_recall), epsilon)

    print(f"Precision [image: {image_evaluation_dict['image_id']}]: {per_example_precision}")
    print(f"Recall [image: {image_evaluation_dict['image_id']}]: {per_example_recall}")
    print(f"F1 score [image: {image_evaluation_dict['image_id']}]: {per_example_f1}")

    total_tp += tp
    total_fp += fp
    total_gt += n_gt

precision = total_tp / max((total_tp + total_fp), epsilon)
recall = total_tp / max(total_gt, epsilon)
f1 = 2 * precision * recall / max((precision + recall), epsilon)

average_precision = coco_eval.stats[0]
average_recall = coco_eval.stats[8]

print("=" * 30)
print(f"PRECISION: {precision}")
print(f"RECALL: {recall}")
print(f"F1_SCORE: {f1}")
print(f"AVERAGE_PRECISION: {average_precision}")
print(f"AVERAGE_RECALL: {average_recall}")
hafiz031 commented 2 months ago

@soeb-hussain this library has so many problems, and unfortunately, I could not find any better alternatives. Change id value of both ground_truth_annotations and predicted_annotations to whatever you want other than 0, then it works. id 0 is creating problem when there is only one annotation. But what I have found if you have more than one items there is apparently no problem in using 0. So, better to avoid id 0 here to avoid confusion. Also, I advise you to do more unit testing like this before using this library, if it is giving proper result as expected or not.

ksv87 commented 2 months ago

Hi, thanks for the code! Tell me, I don't understand something, but it seems to me that it doesn't really make sense

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.517
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.873
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.538
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.225
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.322
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.195
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.526
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.309
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.392
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.663

but

PRECISION: 0.4527624309392265
RECALL: 0.9015401540154016
F1_SCORE: 0.6027951452739978
AVERAGE_PRECISION: 0.5172575619300731
AVERAGE_RECALL: 0.5613861386138613

I have one class, how can the PRECISION (threshold 0.5) be lower than the Average Precision over the threshold 0.5?

hafiz031 commented 1 month ago

Hi @ksv87, can you share a small input example which can reproduce this?