facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.21k stars 5.45k forks source link

Recall evaluation results lower than expected (instance segmentation) #1011

Closed jdegeus closed 3 years ago

jdegeus commented 3 years ago

Hello, I've come pretty far with all the good documentation and info from this repository, thank you for that! 👌 I have a question regarding the evaluation, specifically the recall of the trained model. The recall that is calculated by the default 'inference_on_dataset()' is lower than I expect it to be.

My trained model has 1 class where I have 14 images for my training dataset and 3 images for my validation dataset, where almost every image has 18 annotated objects in COCO format.

Expected results

I looked at the predicted results on the validation dataset with the following code:

dataset_dicts = DatasetCatalog.get(list(cfg.DATASETS.TEST)[0])
for d in dataset_dicts:
    file_name = d["file_name"]
    img = cv2.imread(file_name)
    predictions = predictor(img)["instances"].to("cpu")
    pred_scores = predictions.scores if predictions.has("scores") else None
    print(pred_scores)

With the following output (18 times 3 object predictions):

tensor([1.0000, 1.0000, 1.0000, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999,
        0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9998, 0.9998, 0.9998, 0.9998])
tensor([1.0000, 1.0000, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999,
        0.9999, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998, 0.9997, 0.9997, 0.9996])
tensor([1.0000, 1.0000, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999,
        0.9999, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998, 0.9997, 0.9997])

So as I understand recall is calculated with the formula: recall = True positives / number of ground truths Which should return 1.0 when all IoU's are greater than threshhold.

I have calculated all IoU's myself and they are as follows:

[0.95, 0.96, 0.94, 0.94, 0.96, 0.95, 0.94, 0.93, 0.95, 0.94, 0.94, 0.93, 0.85, 0.92, 0.94, 0.92, 0.89, 0.93]
[0.93, 0.93, 0.94, 0.92, 0.93, 0.93, 0.94, 0.93, 0.94, 0.94, 0.96, 0.91, 0.93, 0.95, 0.94, 0.91, 0.88, 0.94]
[0.93, 0.92, 0.91, 0.93, 0.92, 0.93, 0.92, 0.89, 0.93, 0.92, 0.93, 0.96, 0.92, 0.92, 0.93, 0.9, 0.9, 0.95]

When I calculated the AR (even when IoU=0.50:0.95), my outcome was 1.00.

Actual results

[12/13 21:38:57 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')] [12/13 21:38:57 d2.data.datasets.coco]: Loaded 3 images in COCO format from /content/drive/MyDrive/datasets/lettuce/annotations/lettuce_2020_val.json [12/13 21:38:57 d2.data.common]: Serializing 3 elements to byte tensors and concatenating them all ... [12/13 21:38:57 d2.data.common]: Serialized dataset takes 0.08 MiB [12/13 21:38:57 d2.evaluation.evaluator]: Start inference on 3 images [12/13 21:39:00 d2.evaluation.evaluator]: Total inference time: 0:00:00.667350 (0.667350 s / img per device, on 1 devices) [12/13 21:39:00 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:00 (0.184663 s / img per device, on 1 devices) [12/13 21:39:00 d2.evaluation.coco_evaluation]: Preparing results for COCO format ... [12/13 21:39:00 d2.evaluation.coco_evaluation]: Saving results to ./output/inference/coco_instances_results.json [12/13 21:39:00 d2.evaluation.coco_evaluation]: Evaluating predictions with unofficial COCO API... Loading and preparing results... DONE (t=0.00s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox COCOeval_opt.evaluate() finished in 0.00 seconds. Accumulating evaluation results... COCOeval_opt.accumulate() finished in 0.00 seconds.

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.869 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.869 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.048 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.498 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.898 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.898

[12/13 21:39:00 d2.evaluation.coco_evaluation]: Evaluation results for bbox: AP AP50 AP75 APs APm APl
86.917 100.000 100.000 nan nan 86.917

[12/13 21:39:00 d2.evaluation.coco_evaluation]: Some metrics cannot be computed and is shown as NaN. Loading and preparing results... DONE (t=0.00s) creating index... index created! Running per image evaluation... Evaluate annotation type segm COCOeval_opt.evaluate() finished in 0.01 seconds. Accumulating evaluation results... COCOeval_opt.accumulate() finished in 0.00 seconds.

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.887 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.887 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.050 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.502 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.902 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.902

[12/13 21:39:00 d2.evaluation.coco_evaluation]: Evaluation results for segm: AP AP50 AP75 APs APm APl
88.672 100.000 100.000 nan nan 88.672

Detailed steps to reproduce


cfg = get_cfg()
cfg.merge_from_file("/content/drive/MyDrive/datasets/lettuce/configs/mask_rcnn_R_101_FPN_3x.yaml")
cfg.MODEL.WEIGHTS = "/content/drive/MyDrive/datasets/lettuce/output/model_0074999.pth"

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = Trainer(cfg)
trainer.resume_or_load(resume=True)

res = trainer.test(cfg, trainer.model)

System information

Google Colab Notebook