Open andreaceruti opened 2 years ago
Hi, I have written a custom script for calculating these metrics.
You can head to the below link to fetch the code files needed to calculate them:
https://gist.github.com/shivamsnaik/c5c5e99c00819d2167317b1e56871187
@shivamsnaik I think what you are calcuting are not the standard definition of precision, recall and f1-score. For the precision defined as TP/ (TP+FP) and recall TP/ (TP+FN) I think we have to calculate the TP,FP,FN for each image and then average through the dataset. The problem is that the API is not so user-friendly and so it is a mess to find these values
The TP, FP, etc. are being calculated per image, per iou threshold in the COCOeval.accumulate() function.
But COCO has not provided any code to finally combine all the information and reduce the metrics to a single value.
I just take the precision, recall, and scores received from COCOeval.accumulate() and compute the final metric.
@shivamsnaik Yes I have seen your code, But i do not think that for example when you take the precision, averaging it by the recall values ( precesion_iou = precision[iou_lookup[iou], :, :, 0, -1].mean(1) ) and then averaging it taking the mean of this values is what I am looking for.
This is my problem, I have on the left the grounf truth annotations of an image and on the right what my model infer To calculate the Precision and Recall of my image this is the code I use:
imgIds = [147] coco_eval = COCOeval(coco_gt, coco_dt, "segm") coco_eval.params.imgIds = imgIds coco_eval.evaluate()
image_evaluation_dict = coco_eval.evalImgs[0]
iou_treshold_index = 0
detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]
mask = ~detection_ignore
n_ignored = detection_ignore.sum()
tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum() fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum() n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()
recall = tp / n_gt precision = tp / (tp + fp) f1 = 2 precision recall / (precision + recall)
Do you think this is good or do you think that something is wrong with my reasoning?
@andreaceruti Your approach gives the specific True Positive, F.P. etc. values too. This is really useful. Thanks for sharing it with me. It has helped me understand the COCO.evaluate() results and the purpose of their properties more.
@shivamsnaik thanks for your words, they are really appreciated. I have just adapted what I have seen in a github folder, so I can't take credits of this computation. But since calculating these values is a common problem I hope that also other people would jump into the discussion and try to figure out if there is something wrong.
I have modified @andreaceruti's code a bit to work on the complete test set. I also have some observations including on the library implementation as well (correct me if the observations are not correct):
ignore
flag implementation may have a bug. The value of ignore
is getting overridden by the value of iscrowd
. See here: https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L109
So, if we need to set ignore
, we have no option but to set both ignore
or iscrowd
with the same value or just set iscrowd
only as ignore
is eventually getting the same value after overriding. I have also found some open issues on this.Anyway, the following is my updated code:
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import numpy as np
import pandas as pd
ground_truth_annotations = [
{'id': 1, 'image_id': 0, 'category_id': 1, 'bbox': [200, 200, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
{'id': 2, 'image_id': 0, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
{'id': 3, 'image_id': 2, 'category_id': 1, 'bbox': [150, 130, 100, 100], 'ignore': 1, 'iscrowd': 1, 'area': 10000},
{'id': 5, 'image_id': 2, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
{'id': 4, 'image_id': 4, 'category_id': 1, 'bbox': [80, 90, 100, 100], 'ignore': 1, 'iscrowd': 1, 'area': 10000}
]
predicted_annotations = [
{'id': 5, 'image_id': 0, 'category_id': 1, 'bbox': [250, 250, 100, 100], 'score': 1, 'area': 10000},
{'id': 1, 'image_id': 0, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'score': 1, 'area': 10000},
{'id': 4, 'image_id': 2, 'category_id': 1, 'bbox': [50, 30, 100, 100], 'score': 0.1, 'area': 10000},
{'id': 3, 'image_id': 4, 'category_id': 1, 'bbox': [250, 250, 100, 100], 'score': 0.9, 'area': 10000}
]
image_info_gt = [
{'id': 0, 'file_name': '1.jpg', 'height': 500, 'width': 375},
{'id': 2, 'file_name': '2.jpg', 'height': 435, 'width': 450},
{'id': 4, 'file_name': '3.jpg', 'height': 375, 'width': 500}
]
# Create separate COCO instances for ground truth and predicted annotations
coco_gt = COCO()
coco_gt.dataset = {'images': image_info_gt, 'annotations': ground_truth_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_gt.createIndex()
coco_dt = COCO()
coco_dt.dataset = {'images': image_info_gt, 'annotations': predicted_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_dt.createIndex()
# Create COCOeval object
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
# Set parameters
coco_eval.params.areaRng = [[0 ** 2, 1e5 ** 2]]
coco_eval.params.areaRngLbl = ["all"]
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
print("=" * 30)
pd.DataFrame(coco_eval.evalImgs).to_excel("coco_eval.xlsx", index = False)
total_tp = 0
total_fp = 0
total_gt = 0
epsilon = 0.00000000001 # To avoid divide by 0 case
# Select the index related to IoU = 0.5
iou_treshold_index = 0 # iouThrs - [.5:.05:.95] T=10 IoU thresholds for evaluation, so, index 0 is threshold=0.5
for image_evaluation_dict in coco_eval.evalImgs:
print("-" * 30)
# All the detections from the model, it is a numpy of True/False
detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]
# Here we consider the detection that we can not ignore (we use the not operator on every element of the array)
mask = ~detection_ignore
print(f"Mask on Detection [NotIgnored]: {mask}")
n_ignored = detection_ignore.sum()
print(f"Ignore count from detected bboxes from [image: {image_evaluation_dict['image_id']}]: {n_ignored}")
# And finally we calculate tp, fp and the total positives (n_gt)
tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum()
fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum()
n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()
per_example_precision = tp / max((tp + fp), epsilon)
per_example_recall = tp / max(n_gt, epsilon)
per_example_f1 = 2 * per_example_precision * per_example_recall / max((per_example_precision + per_example_recall), epsilon)
print(f"Precision [image: {image_evaluation_dict['image_id']}]: {per_example_precision}")
print(f"Recall [image: {image_evaluation_dict['image_id']}]: {per_example_recall}")
print(f"F1 score [image: {image_evaluation_dict['image_id']}]: {per_example_f1}")
total_tp += tp
total_fp += fp
total_gt += n_gt
precision = total_tp / max((total_tp + total_fp), epsilon)
recall = total_tp / max(total_gt, epsilon)
f1 = 2 * precision * recall / max((precision + recall), epsilon)
average_precision = coco_eval.stats[0]
average_recall = coco_eval.stats[8]
print("=" * 30)
print(f"PRECISION: {precision}")
print(f"RECALL: {recall}")
print(f"F1_SCORE: {f1}")
print(f"AVERAGE_PRECISION: {average_precision}")
print(f"AVERAGE_RECALL: {average_recall}")
############################### CROSS VALIDATION ############################
import numpy as np
import cv2
for img in image_info_gt:
dummy = np.ones((img["height"], img["width"], 3), dtype=np.uint8) * 255
print(img)
for gt in ground_truth_annotations:
if gt["image_id"] == img["id"]:
overlay = dummy.copy()
cv2.rectangle(overlay, gt["bbox"][:2], np.array(gt["bbox"][2:])+np.array(gt["bbox"][:2]),
(0, 255, 0), cv2.FILLED)
alpha = 0.3 # Transparency factor.
dummy = cv2.addWeighted(overlay, alpha, dummy, 1 - alpha, 0)
for pr in predicted_annotations:
if pr["image_id"] == img["id"]:
overlay = dummy.copy()
cv2.rectangle(overlay, pr["bbox"][:2], np.array(pr["bbox"][2:])+np.array(pr["bbox"][:2]),
(0, 0, 255), cv2.FILLED)
alpha = 0.3 # Transparency factor.
dummy = cv2.addWeighted(overlay, alpha, dummy, 1 - alpha, 0)
cv2.imshow("image", dummy)
cv2.waitKey(0)
cv2.destroyAllWindows()
#############################################################################
This seems to be working for me. Feel free to provide any corrections.
@hafiz031 @andreaceruti im sorry to bother you. i try your code and i got the result like this.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.362
**Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.553**
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.398
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.013
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.324
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.389
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.250
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.549
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.594
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.021
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.625
==============================
**PRECISION: 0.066057091882248**
RECALL: 0.8935682394111258
F1_SCORE: 0.1230199274007991
AVERAGE_PRECISION: 0.3617328617872838
AVERAGE_RECALL: 0.5938189911568441
iou_treshold_index is set to 0. Then iou is 0.5. Why my PRECISION = 0.066 and RECALL = 0.89. But the coco_eval say the Precision is 0.553 when iou=0.5
@mxw20010804 you might be confusing precision with average precision. Precision (not average precision), Recall and F-score are not reported by the library. I have added custom code to calculate those. Learn more on Precision, Recall and Mean Average Precision.
``Hi @hafiz031,
Thanks for sharing the code, it doesnt seem to be working for me. I am running below script for sanity test, and I am getting following values as output: PRECISION: 0.0 RECALL: 0.0 F1_SCORE: 0.0 AVERAGE_PRECISION: 0.0 AVERAGE_RECALL: 0.0
I was expecting precision and recall to be 1, let me know your thoughts.
Cheers
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import numpy as np
import pandas as pd
ground_truth_annotations = [
{'id': 0, 'image_id': 0, 'category_id': 1, 'bbox': [100, 100, 100, 100], 'ignore': 0, 'iscrowd': 0, 'area': 10000},
]
predicted_annotations = [
{'id': 0, 'image_id': 0, 'category_id': 1, 'bbox': [101, 101, 100, 100], 'score': 1, 'area': 10000},
]
image_info_gt = [
{'id': 0, 'file_name': '1.jpg', 'height': 500, 'width': 375},
]
# Create separate COCO instances for ground truth and predicted annotations
coco_gt = COCO()
coco_gt.dataset = {'images': image_info_gt, 'annotations': ground_truth_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_gt.createIndex()
coco_dt = COCO()
coco_dt.dataset = {'images': image_info_gt, 'annotations': predicted_annotations, 'categories': [{'id': 1, 'name': 'object'}]}
coco_dt.createIndex()
# Create COCOeval object
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
# Set parameters
coco_eval.params.areaRng = [[0 , 1e5 ** 2]]
coco_eval.params.areaRngLbl = ["all"]
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
print("=" * 30)
# pd.DataFrame(coco_eval.evalImgs).to_excel("coco_eval.xlsx", index = False)
total_tp = 0
total_fp = 0
total_gt = 0
epsilon = 0.00000000001 # To avoid divide by 0 case
# Select the index related to IoU = 0.5
iou_treshold_index = 3 # iouThrs - [.5:.05:.95] T=10 IoU thresholds for evaluation, so, index 0 is threshold=0.5
for image_evaluation_dict in coco_eval.evalImgs:
print("-" * 30)
# All the detections from the model, it is a numpy of True/False
detection_ignore = image_evaluation_dict["dtIgnore"][iou_treshold_index]
# Here we consider the detection that we can not ignore (we use the not operator on every element of the array)
mask = ~detection_ignore
print(f"Mask on Detection [NotIgnored]: {mask}")
n_ignored = detection_ignore.sum()
print(f"Ignore count from detected bboxes from [image: {image_evaluation_dict['image_id']}]: {n_ignored}")
# And finally we calculate tp, fp and the total positives (n_gt)
tp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] > 0).sum()
fp = (image_evaluation_dict["dtMatches"][iou_treshold_index][mask] == 0).sum()
n_gt = len(image_evaluation_dict["gtIds"]) - image_evaluation_dict["gtIgnore"].astype(int).sum()
per_example_precision = tp / max((tp + fp), epsilon)
per_example_recall = tp / max(n_gt, epsilon)
per_example_f1 = 2 * per_example_precision * per_example_recall / max((per_example_precision + per_example_recall), epsilon)
print(f"Precision [image: {image_evaluation_dict['image_id']}]: {per_example_precision}")
print(f"Recall [image: {image_evaluation_dict['image_id']}]: {per_example_recall}")
print(f"F1 score [image: {image_evaluation_dict['image_id']}]: {per_example_f1}")
total_tp += tp
total_fp += fp
total_gt += n_gt
precision = total_tp / max((total_tp + total_fp), epsilon)
recall = total_tp / max(total_gt, epsilon)
f1 = 2 * precision * recall / max((precision + recall), epsilon)
average_precision = coco_eval.stats[0]
average_recall = coco_eval.stats[8]
print("=" * 30)
print(f"PRECISION: {precision}")
print(f"RECALL: {recall}")
print(f"F1_SCORE: {f1}")
print(f"AVERAGE_PRECISION: {average_precision}")
print(f"AVERAGE_RECALL: {average_recall}")
@soeb-hussain this library has so many problems, and unfortunately, I could not find any better alternatives. Change id
value of both ground_truth_annotations
and predicted_annotations
to whatever you want other than 0, then it works. id
0 is creating problem when there is only one annotation. But what I have found if you have more than one items there is apparently no problem in using 0. So, better to avoid id
0 here to avoid confusion. Also, I advise you to do more unit testing like this before using this library, if it is giving proper result as expected or not.
Hi, thanks for the code! Tell me, I don't understand something, but it seems to me that it doesn't really make sense
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.517
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.873
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.538
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.225
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.322
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.195
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.526
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.561
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.309
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.392
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.663
but
PRECISION: 0.4527624309392265
RECALL: 0.9015401540154016
F1_SCORE: 0.6027951452739978
AVERAGE_PRECISION: 0.5172575619300731
AVERAGE_RECALL: 0.5613861386138613
I have one class, how can the PRECISION (threshold 0.5) be lower than the Average Precision over the threshold 0.5?
Hi @ksv87, can you share a small input example which can reproduce this?
Has someone managed to find the right way to calculate the F1 score for different tasks?