Calculating Precision, Recall at least differently from MLT2017 paper

Yuliang-Liu / Curve-Text-Detector

This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

642 stars 156 forks source link

Calculating Precision, Recall at least differently from MLT2017 paper #39

Closed XYudong closed 5 years ago

XYudong commented 5 years ago

Hi,

Thank you so much for your repository!

In voc_eval_polygon.py, it seems that you are accumulating tp, fp through all the test images? And then calculating prec, rec at the end. Let me know if I misunderstand it.

But I think, at least in the paper, we should calculate prec and recall for each image, and then taking the average of these precision and recall.

Thanks

Yuliang-Liu commented 5 years ago

Hi,

The evaluation code follows the same as ICDAR evaluation metric. You can also refer to https://github.com/Yuliang-Liu/TIoU-metric/tree/master/curved-tiou which produces exact the same result.

It would be appreciated if you can explain why in this paper calculating result for each image would be better. We can discuss with it.

As for MLT, they do calculate through all the test images (so, a mistake in their MLT 2017 paper) like I did. Here is part of the text from MLT official email:

"The recall, precision and f-measure are NOT calculated for each image individually. They are computed based on the detected boxes in all the images (of course the boxes are matched/processed image by image). There was a confusion because in the paper of MLT-2017, there was a mistake in describing the evaluation protocol (in the paper, it is mentioned that the f-measure is computed per image and then averaged across the images -- this is not what we did)."

Thanks

XYudong commented 5 years ago

Wow! Thank you so much. Incredible! I found the correction at the bottom of their website.

Btw, about the two methods, the only advantage I came up with for the method, in your code, accumulating through all test images(let's say method1) is that it seems less sensitive to outliers. E.g. a = np.array([4, 6, 1, 2, 8, 18]) b = np.array([5, 8, 6, 10, 11, 22]) If we see 1, 2 in a as outliers/bad samples Method1: 0.629, Method2: 0.577;

Is this the reason we don't use the method2? I still feel that the unit of testing a model is a single image, so we should calculate metrics based on each image. I don't know, maybe there are some statistics tricks?

Thank you again

Yuliang-Liu commented 5 years ago

Yes, outliers is one possible reason. See below figure,

For my concern, I would say method 1 is better. Currently, I have not found any dataset based on method 2 in the literature. This is only my personal idea; and if you are obsessed with it, I am sure there should be more theoretical explains from previous works.

Best regards

XYudong commented 5 years ago

Yeap. Pretty cool. Thank you so much for your reply. It's a joyful discussion.

Yuliang-Liu commented 5 years ago

You are welcome. Thanks for your attention.