Recall evaluation results lower than expected (instance segmentation)

Hello, I've come pretty far with all the good documentation and info from this repository, thank you for that! 👌 I have a question regarding the evaluation, specifically the recall of the trained model. The recall that is calculated by the default 'inference_on_dataset()' is lower than I expect it to be.

My trained model has 1 class where I have 14 images for my training dataset and 3 images for my validation dataset, where almost every image has 18 annotated objects in COCO format.

Expected results

I looked at the predicted results on the validation dataset with the following code:

dataset_dicts = DatasetCatalog.get(list(cfg.DATASETS.TEST)[0])
for d in dataset_dicts:
    file_name = d["file_name"]
    img = cv2.imread(file_name)
    predictions = predictor(img)["instances"].to("cpu")
    pred_scores = predictions.scores if predictions.has("scores") else None
    print(pred_scores)

With the following output (18 times 3 object predictions):

tensor([1.0000, 1.0000, 1.0000, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999,
        0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9998, 0.9998, 0.9998, 0.9998])
tensor([1.0000, 1.0000, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999,
        0.9999, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998, 0.9997, 0.9997, 0.9996])
tensor([1.0000, 1.0000, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999,
        0.9999, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998, 0.9997, 0.9997])

So as I understand recall is calculated with the formula: recall = True positives / number of ground truths Which should return 1.0 when all IoU's are greater than threshhold.

I have calculated all IoU's myself and they are as follows:

[0.95, 0.96, 0.94, 0.94, 0.96, 0.95, 0.94, 0.93, 0.95, 0.94, 0.94, 0.93, 0.85, 0.92, 0.94, 0.92, 0.89, 0.93]
[0.93, 0.93, 0.94, 0.92, 0.93, 0.93, 0.94, 0.93, 0.94, 0.94, 0.96, 0.91, 0.93, 0.95, 0.94, 0.91, 0.88, 0.94]
[0.93, 0.92, 0.91, 0.93, 0.92, 0.93, 0.92, 0.89, 0.93, 0.92, 0.93, 0.96, 0.92, 0.92, 0.93, 0.9, 0.9, 0.95]

When I calculated the AR (even when IoU=0.50:0.95), my outcome was 1.00.

So what am I not taking in account?
Is there detailed documentation about the calculations? (had a hard time understanding the source code)

Actual results

[12/13 21:38:57 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')] [12/13 21:38:57 d2.data.datasets.coco]: Loaded 3 images in COCO format from /content/drive/MyDrive/datasets/lettuce/annotations/lettuce_2020_val.json [12/13 21:38:57 d2.data.common]: Serializing 3 elements to byte tensors and concatenating them all ... [12/13 21:38:57 d2.data.common]: Serialized dataset takes 0.08 MiB [12/13 21:38:57 d2.evaluation.evaluator]: Start inference on 3 images [12/13 21:39:00 d2.evaluation.evaluator]: Total inference time: 0:00:00.667350 (0.667350 s / img per device, on 1 devices) [12/13 21:39:00 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:00 (0.184663 s / img per device, on 1 devices) [12/13 21:39:00 d2.evaluation.coco_evaluation]: Preparing results for COCO format ... [12/13 21:39:00 d2.evaluation.coco_evaluation]: Saving results to ./output/inference/coco_instances_results.json [12/13 21:39:00 d2.evaluation.coco_evaluation]: Evaluating predictions with unofficial COCO API... Loading and preparing results... DONE (t=0.00s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox COCOeval_opt.evaluate() finished in 0.00 seconds. Accumulating evaluation results... COCOeval_opt.accumulate() finished in 0.00 seconds.

[12/13 21:39:00 d2.evaluation.coco_evaluation]: Evaluation results for bbox:	AP	AP50	AP75	APs	APm	APl
86.917	100.000	100.000	nan	nan	86.917

[12/13 21:39:00 d2.evaluation.coco_evaluation]: Some metrics cannot be computed and is shown as NaN. Loading and preparing results... DONE (t=0.00s) creating index... index created! Running per image evaluation... Evaluate annotation type segm COCOeval_opt.evaluate() finished in 0.01 seconds. Accumulating evaluation results... COCOeval_opt.accumulate() finished in 0.00 seconds.

[12/13 21:39:00 d2.evaluation.coco_evaluation]: Evaluation results for segm:	AP	AP50	AP75	APs	APm	APl
88.672	100.000	100.000	nan	nan	88.672

Detailed steps to reproduce


cfg = get_cfg()
cfg.merge_from_file("/content/drive/MyDrive/datasets/lettuce/configs/mask_rcnn_R_101_FPN_3x.yaml")
cfg.MODEL.WEIGHTS = "/content/drive/MyDrive/datasets/lettuce/output/model_0074999.pth"

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = Trainer(cfg)
trainer.resume_or_load(resume=True)

res = trainer.test(cfg, trainer.model)

System information

Google Colab Notebook

facebookresearch / Detectron