andreaceruti commented 2 years ago

Has someone managed to find the right way to calculate the F1 score for different tasks?

shivamsnaik commented 2 years ago

Hi, I have written a custom script for calculating these metrics.

You can head to the below link to fetch the code files needed to calculate them:

https://gist.github.com/shivamsnaik/c5c5e99c00819d2167317b1e56871187

andreaceruti commented 2 years ago

@shivamsnaik I think what you are calcuting are not the standard definition of precision, recall and f1-score. For the precision defined as TP/ (TP+FP) and recall TP/ (TP+FN) I think we have to calculate the TP,FP,FN for each image and then average through the dataset. The problem is that the API is not so user-friendly and so it is a mess to find these values

shivamsnaik commented 2 years ago

The TP, FP, etc. are being calculated per image, per iou threshold in the COCOeval.accumulate() function.

But COCO has not provided any code to finally combine all the information and reduce the metrics to a single value.

I just take the precision, recall, and scores received from COCOeval.accumulate() and compute the final metric.

andreaceruti commented 2 years ago

@shivamsnaik Yes I have seen your code, But i do not think that for example when you take the precision, averaging it by the recall values ( precesion_iou = precision[iou_lookup[iou], :, :, 0, -1].mean(1) ) and then averaging it taking the mean of this values is what I am looking for.

This is my problem, I have on the left the grounf truth annotations of an image and on the right what my model infer inference To calculate the Precision and Recall of my image this is the code I use:

select one image

imgIds = [147] coco_eval = COCOeval(coco_gt, coco_dt, "segm") coco_eval.params.imgIds = imgIds coco_eval.evaluate()

now I will use the dict produced by the evaluate() function to calculate all

here i simply select the dict that corresponds to 'aRng': [0, 10000000000.0]

image_evaluation_dict = coco_eval.evalImgs[0]

select the index related to IoU = 0.5

iou_treshold_index = 0

all the detections from the model, it is a numpy of True/False (In my case they are all False)

detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]

here we consider the detection that we can not ignore (we use the not operator on every element of the array)

mask = ~detection_ignore

detections number

n_ignored = detection_ignore.sum()

and finally we calculate tp, fp and the total positives

tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum() fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum() n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()

recall = tp / n_gt precision = tp / (tp + fp) f1 = 2 precision recall / (precision + recall)

Do you think this is good or do you think that something is wrong with my reasoning?

shivamsnaik commented 2 years ago

@andreaceruti Your approach gives the specific True Positive, F.P. etc. values too. This is really useful. Thanks for sharing it with me. It has helped me understand the COCO.evaluate() results and the purpose of their properties more.

andreaceruti commented 2 years ago

@shivamsnaik thanks for your words, they are really appreciated. I have just adapted what I have seen in a github folder, so I can't take credits of this computation. But since calculating these values is a common problem I hope that also other people would jump into the discussion and try to figure out if there is something wrong.

hafiz031 commented 9 months ago

I have modified @andreaceruti's code a bit to work on the complete test set. I also have some observations including on the library implementation as well (correct me if the observations are not correct):

The ignore flag implementation may have a bug. The value of ignore is getting overridden by the value of iscrowd. See here: https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L109 So, if we need to set ignore, we have no option but to set both ignore or iscrowd with the same value or just set iscrowd only as ignore is eventually getting the same value after overriding. I have also found some open issues on this.
Ignoring some image in ground truth annotation is like assuming this annotation simply doesn't exist (during evaluation time). It doesn't seem to have any impact on evaluation. This flag is used to prevent the system from generating any negative examples from those patches during train time. Because those patches contain objects that can confuse the model during training even if those objects are not from any intended objects.
Setting 0 for category_id creates problems. Perhaps this value is reserved (I don't know).
Setting 0 for id in ground_truth_annotations and predicted_annotations when there is only one annotation will give you average precision, precision, recall and F-score all 0s (I don't know why, this behavior is found while analyzing issue of this comment). Better to avoid using id 0 here as well.
We need not match the id in ground truth and prediction boxes, all we need to do just ensure the ids are unique.
Any of the ids (id, image_id, category_id) doesn't need to be consecutive values.

Anyway, the following is my updated code:

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import numpy as np
import pandas as pd

ground_truth_annotations = [
    {'id': 1, 'image_id': 0, 'category_id': 1, 'bbox': [200, 200, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
    {'id': 2, 'image_id': 0, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
    {'id': 3, 'image_id': 2, 'category_id': 1, 'bbox': [150, 130, 100, 100], 'ignore': 1, 'iscrowd': 1, 'area': 10000},
    {'id': 5, 'image_id': 2, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
    {'id': 4, 'image_id': 4, 'category_id': 1, 'bbox': [80, 90, 100, 100], 'ignore': 1, 'iscrowd': 1, 'area': 10000}
]

predicted_annotations = [
    {'id': 5, 'image_id': 0, 'category_id': 1, 'bbox': [250, 250, 100, 100], 'score': 1, 'area': 10000},
    {'id': 1, 'image_id': 0, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'score': 1, 'area': 10000},
    {'id': 4, 'image_id': 2, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'score': 0.1, 'area': 10000},
    {'id': 3, 'image_id': 4, 'category_id': 1, 'bbox': [250, 250, 100, 100], 'score': 0.9, 'area': 10000}
]

image_info_gt = [
                {'id': 0, 'file_name': '1.jpg', 'height': 500, 'width': 375},
                {'id': 2, 'file_name': '2.jpg', 'height': 435, 'width': 450},
                {'id': 4, 'file_name': '3.jpg', 'height': 375, 'width': 500}
            ]

# Create separate COCO instances for ground truth and predicted annotations
coco_gt = COCO()
coco_gt.dataset = {'images': image_info_gt, 'annotations': ground_truth_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_gt.createIndex()

coco_dt = COCO()
coco_dt.dataset = {'images': image_info_gt, 'annotations': predicted_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_dt.createIndex()

# Create COCOeval object
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')

# Set parameters
coco_eval.params.areaRng = [[0 ** 2, 1e5 ** 2]]
coco_eval.params.areaRngLbl = ["all"]

coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
print("=" * 30)

pd.DataFrame(coco_eval.evalImgs).to_excel("coco_eval.xlsx", index = False)

total_tp = 0
total_fp = 0
total_gt = 0
epsilon = 0.00000000001 # To avoid divide by 0 case

# Select the index related to IoU = 0.5
iou_treshold_index = 0 # iouThrs - [.5:.05:.95] T=10 IoU thresholds for evaluation, so, index 0 is threshold=0.5

for image_evaluation_dict in coco_eval.evalImgs:
    print("-" * 30)

    # All the detections from the model, it is a numpy of True/False
    detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]

    # Here we consider the detection that we can not ignore (we use the not operator on every element of the array)
    mask = ~detection_ignore
    print(f"Mask on Detection [NotIgnored]: {mask}")

    n_ignored = detection_ignore.sum()
    print(f"Ignore count from detected bboxes from [image: {image_evaluation_dict['image_id']}]: {n_ignored}")

    # And finally we calculate tp, fp and the total positives (n_gt)
    tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum()
    fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum()
    n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()

    per_example_precision = tp / max((tp + fp), epsilon)
    per_example_recall = tp / max(n_gt, epsilon)
    per_example_f1 = 2 * per_example_precision * per_example_recall / max((per_example_precision + per_example_recall), epsilon)

    print(f"Precision [image: {image_evaluation_dict['image_id']}]: {per_example_precision}")
    print(f"Recall [image: {image_evaluation_dict['image_id']}]: {per_example_recall}")
    print(f"F1 score [image: {image_evaluation_dict['image_id']}]: {per_example_f1}")

    total_tp += tp
    total_fp += fp
    total_gt += n_gt

precision = total_tp / max((total_tp + total_fp), epsilon)
recall = total_tp / max(total_gt, epsilon)
f1 = 2 * precision * recall / max((precision + recall), epsilon)

average_precision = coco_eval.stats[0]
average_recall = coco_eval.stats[8]

print("=" * 30)
print(f"PRECISION: {precision}")
print(f"RECALL: {recall}")
print(f"F1_SCORE: {f1}")
print(f"AVERAGE_PRECISION: {average_precision}")
print(f"AVERAGE_RECALL: {average_recall}")

############################### CROSS VALIDATION ############################
import numpy as np
import cv2

for img in image_info_gt:
    dummy = np.ones((img["height"], img["width"], 3), dtype=np.uint8) * 255
    print(img)
    for gt in ground_truth_annotations:
        if gt["image_id"] == img["id"]:
            overlay = dummy.copy()
            cv2.rectangle(overlay, gt["bbox"][:2], np.array(gt["bbox"][2:])+np.array(gt["bbox"][:2]), 
                          (0, 255, 0), cv2.FILLED)
            alpha = 0.3  # Transparency factor.
            dummy = cv2.addWeighted(overlay, alpha, dummy, 1 - alpha, 0)
    for pr in predicted_annotations:
        if pr["image_id"] == img["id"]:
            overlay = dummy.copy()
            cv2.rectangle(overlay, pr["bbox"][:2], np.array(pr["bbox"][2:])+np.array(pr["bbox"][:2]), 
                          (0, 0, 255), cv2.FILLED)
            alpha = 0.3  # Transparency factor.
            dummy = cv2.addWeighted(overlay, alpha, dummy, 1 - alpha, 0)
    cv2.imshow("image", dummy)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
#############################################################################

This seems to be working for me. Feel free to provide any corrections.

mxw20010804 commented 3 months ago

@hafiz031 @andreaceruti im sorry to bother you. i try your code and i got the result like this.

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.362
 **Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.553**
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.398
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.013
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.324
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.389
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.250
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.594
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.021
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.625
==============================
**PRECISION: 0.066057091882248**
RECALL: 0.8935682394111258
F1_SCORE: 0.1230199274007991
AVERAGE_PRECISION: 0.3617328617872838
AVERAGE_RECALL: 0.5938189911568441

iou_treshold_index is set to 0. Then iou is 0.5. Why my PRECISION = 0.066 and RECALL = 0.89. But the coco_eval say the Precision is 0.553 when iou=0.5

hafiz031 commented 2 months ago

@mxw20010804 you might be confusing precision with average precision. Precision (not average precision), Recall and F-score are not reported by the library. I have added custom code to calculate those. Learn more on Precision, Recall and Mean Average Precision.

soeb-hussain commented 2 months ago

``Hi @hafiz031,

Thanks for sharing the code, it doesnt seem to be working for me. I am running below script for sanity test, and I am getting following values as output: PRECISION: 0.0 RECALL: 0.0 F1_SCORE: 0.0 AVERAGE_PRECISION: 0.0 AVERAGE_RECALL: 0.0

I was expecting precision and recall to be 1, let me know your thoughts.

Cheers

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import numpy as np
import pandas as pd

ground_truth_annotations = [
    {'id': 0, 'image_id': 0, 'category_id': 1, 'bbox': [100, 100, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
]

predicted_annotations = [
    {'id': 0, 'image_id': 0, 'category_id': 1, 'bbox': [101, 101, 100, 100], 'score': 1, 'area': 10000},
]

image_info_gt = [
                {'id': 0, 'file_name': '1.jpg', 'height': 500, 'width': 375},
            ]

# Create separate COCO instances for ground truth and predicted annotations
coco_gt = COCO()
coco_gt.dataset = {'images': image_info_gt, 'annotations': ground_truth_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_gt.createIndex()

coco_dt = COCO()
coco_dt.dataset = {'images': image_info_gt, 'annotations': predicted_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_dt.createIndex()

# Create COCOeval object
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')

# Set parameters
coco_eval.params.areaRng = [[0 , 1e5 ** 2]]
coco_eval.params.areaRngLbl = ["all"]

coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
print("=" * 30)

# pd.DataFrame(coco_eval.evalImgs).to_excel("coco_eval.xlsx", index = False)

total_tp = 0
total_fp = 0
total_gt = 0
epsilon = 0.00000000001 # To avoid divide by 0 case

# Select the index related to IoU = 0.5
iou_treshold_index = 3 # iouThrs - [.5:.05:.95] T=10 IoU thresholds for evaluation, so, index 0 is threshold=0.5

for image_evaluation_dict in coco_eval.evalImgs:
    print("-" * 30)

    # All the detections from the model, it is a numpy of True/False
    detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]

    # Here we consider the detection that we can not ignore (we use the not operator on every element of the array)
    mask = ~detection_ignore
    print(f"Mask on Detection [NotIgnored]: {mask}")

    n_ignored = detection_ignore.sum()
    print(f"Ignore count from detected bboxes from [image: {image_evaluation_dict['image_id']}]: {n_ignored}")

    # And finally we calculate tp, fp and the total positives (n_gt)
    tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum()
    fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum()
    n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()

    per_example_precision = tp / max((tp + fp), epsilon)
    per_example_recall = tp / max(n_gt, epsilon)
    per_example_f1 = 2 * per_example_precision * per_example_recall / max((per_example_precision + per_example_recall), epsilon)

    print(f"Precision [image: {image_evaluation_dict['image_id']}]: {per_example_precision}")
    print(f"Recall [image: {image_evaluation_dict['image_id']}]: {per_example_recall}")
    print(f"F1 score [image: {image_evaluation_dict['image_id']}]: {per_example_f1}")

    total_tp += tp
    total_fp += fp
    total_gt += n_gt

precision = total_tp / max((total_tp + total_fp), epsilon)
recall = total_tp / max(total_gt, epsilon)
f1 = 2 * precision * recall / max((precision + recall), epsilon)

average_precision = coco_eval.stats[0]
average_recall = coco_eval.stats[8]

print("=" * 30)
print(f"PRECISION: {precision}")
print(f"RECALL: {recall}")
print(f"F1_SCORE: {f1}")
print(f"AVERAGE_PRECISION: {average_precision}")
print(f"AVERAGE_RECALL: {average_recall}")

hafiz031 commented 2 months ago

@soeb-hussain this library has so many problems, and unfortunately, I could not find any better alternatives. Change id value of both ground_truth_annotations and predicted_annotations to whatever you want other than 0, then it works. id 0 is creating problem when there is only one annotation. But what I have found if you have more than one items there is apparently no problem in using 0. So, better to avoid id 0 here to avoid confusion. Also, I advise you to do more unit testing like this before using this library, if it is giving proper result as expected or not.

ksv87 commented 2 months ago

Hi, thanks for the code! Tell me, I don't understand something, but it seems to me that it doesn't really make sense

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.517
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.873
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.538
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.225
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.322
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.195
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.526
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.309
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.392
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.663

but

PRECISION: 0.4527624309392265
RECALL: 0.9015401540154016
F1_SCORE: 0.6027951452739978
AVERAGE_PRECISION: 0.5172575619300731
AVERAGE_RECALL: 0.5613861386138613

I have one class, how can the PRECISION (threshold 0.5) be lower than the Average Precision over the threshold 0.5?

hafiz031 commented 1 month ago

Hi @ksv87, can you share a small input example which can reproduce this?

cocodataset / cocoapi

F1 score calculation #572

select one image

now I will use the dict produced by the evaluate() function to calculate all

here i simply select the dict that corresponds to 'aRng': [0, 10000000000.0]

select the index related to IoU = 0.5

all the detections from the model, it is a numpy of True/False (In my case they are all False)

here we consider the detection that we can not ignore (we use the not operator on every element of the array)

detections number

and finally we calculate tp, fp and the total positives