What is Average Recall?

MatthewInkawhich commented 5 years ago

After looking around the internet (including this paper, I cannot seem to find a satisfactory explanation of the Average Recall (AR) metric. On the COCO website, it describes AR as: "the maximum recall given a fixed number of detections per image, averaged over categories and IoUs".

What does "maximum recall" mean here?

I was wondering if someone could give a reference or a high level overview of the AR calculation algorithm.

Thanks! Matt

Michael-J98 commented 4 years ago

After looking around the internet (including this paper, I cannot seem to find a satisfactory explanation of the Average Recall (AR) metric. On the COCO website, it describes AR as: "the maximum recall given a fixed number of detections per image, averaged over categories and IoUs".

What does "maximum recall" mean here?

I was wondering if someone could give a reference or a high level overview of the AR calculation algorithm.

Thanks! Matt

Have you solved it yet? I found the same question.

WillianaLeite commented 4 years ago

I would also like to know what it means and how it is calculated.

Alraemon commented 4 years ago

I got it perhaps. That "maximum recall" is calculated by the number of objects which can be detected under specific times of detection per image, such as 10 times per image , 100 times per image, being divided by the ground truth in the image.

My expression may be a little bit confusing. XD

Michael-J98 commented 4 years ago

I got it perhaps. That "maximum recall" is calculated by the number of objects which can be detected under specific times of detection per image, such as 10 times per image , 100 times per image, being divided by the ground truth in the image.

My expression may be a little bit confusing. XD

You mean only to calculate recall rather than calculate some area under recall-iou curve?

Alraemon commented 4 years ago

Yup, at least I don’t see any description of integral stuff. I think this scalar is made for testing the capability of model whether it can inference enough objects in limited times. Just my guess.

Alraemon commented 4 years ago

Wait, the area of recall-iou curve should be calculated. It says AR is calculated with IoU and all categories averaged.

JamesMcCullochDickens commented 3 years ago

I got you guys. Take a set of iou vals, say iou_vals = [0.5, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95] (which is 10 values). Then for each category, compute the recall of your model at detecting that category at the specific IOU threshold. Then consider two cases 1) a mean average recall at a specific IOU value (over all categories) 2) a mean average recall for all IOU values (over all categories)

Both cases can be considered over a varying amount of maximum detections, 1, 10, 100, and even 1000 for RPNs (more on this later).

For case 1), for a specific iou_val, compute the recall for each category, and then take the average of those values over all classes

For case 2) for each class, average the recall at each of the iou vals, and then take the average of this average for each class, this is your mAR^{D} for D detections.

However, relating to the paper "What makes for effective detection proposals?", keep in mind they are computing the average recall of a region proposal method, and not an object detection model. Related, you can see on page six of the Feature Pyramid Networks paper, that their use of Average Recall is for the RPN variants used, and not for the overall model. Testing on Pytorch's resnet-50 fpn faster rcnn model pre-trained on MS-COCO 2017 train set I tested the mean average recall variant 2) on the last 5K of the val set with 100 detections, and I get a mean average recall over all IOU values at 100 detections to be 0.33. You can find this model on the torchvision github.

For a smaller number of detections, say 10, I get something much worse (like 0.26 if memory serves correct). I don't recall what happened when I fixed the iou val, but you can imagine that if it was 0.5 (the min iou val), that overall it would be much better, say in the range of 0.50-0.60. This should give you an idea of some baseline level of performance.

Now for anyone who wants to compute the integral of the recall_iou curve MORE precisely, you could use Scikit Learn's AUC = Area Under Curve functionality ( see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html ), and simply compute your recall/iou pairs and plug them into this function. I'd be interested to see if anyone takes this approach. You might have to add endpoints by the way, as this is the case in using recall-precision pairs to compute the Pascal VOC 2012 metric for average precision at a specific IOU....

spdavern commented 3 years ago

I had this question too. This reference, pp. 11-12 had the best explanation that I've found:

Padilla, R., Passos, W. L., Dias, T. L. B., Netto, S. L., & da Silva, E. A. B. (2021). A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics, 10(3), 279. https://doi.org/10.3390/electronics10030279 pdf

PapaMadeleine2022 commented 2 years ago

"one detection per image" means what? one bbox per image, not one gt per image ?

XueZ-phd commented 2 years ago

I got you guys. Take a set of iou vals, say iou_vals = [0.5, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95] (which is 10 values). Then for each category, compute the recall of your model at detecting that category at the specific IOU threshold. Then consider two cases

a mean average recall at a specific IOU value (over all categories)

a mean average recall for all IOU values (over all categories)

Both cases can be considered over a varying amount of maximum detections, 1, 10, 100, and even 1000 for RPNs (more on this later).

For case 1), for a specific iou_val, compute the recall for each category, and then take the average of those values over all classes

For case 2) for each class, average the recall at each of the iou vals, and then take the average of this average for each class, this is your mAR^{D} for D detections.

However, relating to the paper "What makes for effective detection proposals?", keep in mind they are computing the average recall of a region proposal method, and not an object detection model. Related, you can see on page six of the Feature Pyramid Networks paper, that their use of Average Recall is for the RPN variants used, and not for the overall model. Testing on Pytorch's resnet-50 fpn faster rcnn model pre-trained on MS-COCO 2017 train set I tested the mean average recall variant 2) on the last 5K of the val set with 100 detections, and I get a mean average recall over all IOU values at 100 detections to be 0.33. You can find this model on the torchvision github.

For a smaller number of detections, say 10, I get something much worse (like 0.26 if memory serves correct). I don't recall what happened when I fixed the iou val, but you can imagine that if it was 0.5 (the min iou val), that overall it would be much better, say in the range of 0.50-0.60. This should give you an idea of some baseline level of performance.

Now for anyone who wants to compute the integral of the recall_iou curve MORE precisely, you could use Scikit Learn's AUC = Area Under Curve functionality ( see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html ), and simply compute your recall/iou pairs and plug them into this function. I'd be interested to see if anyone takes this approach. You might have to add endpoints by the way, as this is the case in using recall-precision pairs to compute the Pascal VOC 2012 metric for average precision at a specific IOU....

Your explanation helps me, but one question still confuses me. At the RPN stage, there often are more than 100 proposals for a specific class. So, is the AR max=100 sufficient to evaluate the performance of RPN?

JamesMcCullochDickens commented 2 years ago

@XueZ-phd That's a fair point, if you look at the torchvision implementation here: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py

They define four variables (with default values)

rpn_pre_nms_top_n_train (int) = 2000: number of proposals to keep before applying NMS during training rpn_pre_nms_top_n_test (int) = 1000: number of proposals to keep before applying NMS during testing rpn_post_nms_top_n_train (int) = 2000: number of proposals to keep after applying NMS during training rpn_post_nms_top_n_test (int) = 1000: number of proposals to keep after applying NMS during testing

I think in older papers/implementations they probably used less region proposals at train/test time so the number 100 made sense. I think I have seen AR max 1000 tho if I recall correctly (it's been a while since I've reviewed these older object detection papers). I notice in the FPN paper, like you say, they choose max=100. I think this is fair in a way, since after NMS there really shouldn't be more than 100 sufficiently different region proposals given that Pascal VOC and MS COCO have on average less than 20 objects in an image. I suspect the difference between AR_100 and AR_1000 would be trivial, could be wrong tho, nothing wrong with experimenting.

JamesMcCullochDickens commented 2 years ago

@XueZ-phd When you think about it, you could simply increase the recall of ANY region proposal network by adding random bounding boxes of various sizes/aspect ratios/centre locations, but then the network would be regressing/classifying two types of region proposals 1) Those that are pretty close to the correct ground truth bounding box 2) Those that are totally off from any ground truth bounding box

And I suspect this would result in a mAP precision drop as well as a longer training schedule, in effect it would basically defeat the entire purpose of the region proposal network, i.e. that classification/regression layers input with good regions perform better than those with bad regions. So in a way mean average recall for too large a number of boxes could be a bad indicator of how the model will perform once its trained on those region proposals, given that the boxes have sufficient variance in their shapes and locations.

XueZ-phd commented 2 years ago

@JamesMcCullochDickens Thanks for the prompt response. I see your point -- the first 100 proposals have a high confidence level, so they are sufficient. Now, I have a question related to RPN. We generated a number of proposals in the RPN phase, and these proposals are listed in descending order of confidence. Based on this knowledge, I have the following questions.

FPN applied NMS on each individual feature map scale, but after combining the proposals from different feature map scales, there are still many overlapping bounding boxes with high IOUs. I would like to know if these overlapping bounding boxes with high IOU degrade the performance of r-cnn. In my pedestrian detection task, I found that repeated predictions lead to severe FP.
We know that proposals consist of bounding boxes and confidence. ROI pooling/ROI align generates ROI features based on bounding boxes only and ignores confidence. Given the first 100 proposals, these proposals with lower confidence may not be reliable bounding boxes. So, do these less reliable proposals degrade the performance of rcnn?

JamesMcCullochDickens commented 2 years ago

This is an interesting question. Actually in the torchvision model, you can see here: https://github.com/pytorch/vision/blob/28557e0cfe9113a5285330542264f03e4ba74535/torchvision/models/detection/rpn.py#L247

They actually apply nms after sampling from each level. I don't recall if the original FPN paper makes this explicit.

For this question, I think this is a hyperparameter tuning question specific to your model, the dataset, etc. I have experimented with less region proposals for the loss computation of the RPN than the torchvision model (this is basically my reference model), and I found it either didn't make a difference or made the performance worse. I think you'll find the same, but you can tweak this. The thing is that I find some object detection datasets have a ton of objects in the images, whereas others have less, maybe this should factor into your consideration for your detection problem. Pedestrians in a dense area need a lot of detections.

As for false positives, that's a bit of a curious issue for a pedestrian detector, given that humans are very distinct identifiable objects. Typically I think the issues with these models that I see causing bad performance is the bounding boxes not tightly fitting around the hands or the feet, or missing a person who occludes/is occluded by another person. I'd be curious to see what dataset you are training on/pretraining. What is the nature of your false positives? Are they complete misses, duplicates, or like boxes that are just shy of the 0.5 iou with a GT box?

Perhaps you can message me privately since this is getting a bit off topic from the original post.

sainivedh19pt commented 1 year ago

"one detection per image" means what? one bbox per image, not one gt per image ?

Yes, this is not with GT

rizavelioglu commented 9 months ago

I believe the max parameter (or maxDets inside the code) is misinterpreted. Please see my comment on this issue for a detailed explanation.

TL;DR: Contrary to the belief that maxDets selects the top maxDets detections with the highest score overall in an image, it actually operates on a per-class basis.

Consider an instance where a given image has 3 predictions for Class-1 and 11 predictions for Class-2. If maxDets=10, it does not solely consider the top-10 boxes with the highest scores across all classes. Instead, it evaluates the top-10 detections for each individual class. In this scenario, the algorithm evaluates the 3 boxes for Class-1 and the top-10 boxes for Class-2, discarding the remaining predictions.

cocodataset / cocoapi

What is Average Recall? #321