cs-chan / Total-Text-Dataset

Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
BSD 3-Clause "New" or "Revised" License
748 stars 142 forks source link

Confused about the evaluation parameters #13

Closed lillyPJ closed 5 years ago

lillyPJ commented 5 years ago

Hi. According to standard Detval evaluation protocol, "tr = 0.8, tp = 0.4" (which is also your default setting in the MATLAB-code-Eval.m). But you recommend "tr = 0.7 and tp = 0.6" in your _EvaluationProtocol/README.md file.

We recommend tr = 0.7 and tp = 0.6 threshold for a fairer evaluation with polygon ground-truth and detection format.

I am confused about how to set tr and tp when I want to compare my results with other methods (listed in the Tabel Ranking)

Detection (based on DetEval evaluation protocol, unless stated)

Method Precision (%) Recall (%) F-measure (%) Published at
MSR [paper] 85.2 73.0 78.6 arXiv:1901.02596
FTSN [paper] 84.7 78.0 81.3 ICPR2018
TextSnake [paper] 82.7 74.5 78.4 ECCV2018
TextField [paper] 81.2 79.9 80.6 TIP2019
CTD [paper] 74.0 71.0 73.0 PR2019
Mask TextSpotter [paper] 69.0 55.0 61.3 ECCV2018
TextNet [paper] 68.2 59.5 63.5 ACCV2018
Textboxes [paper] 62.1 45.5 52.5 AAAI2017
EAST [paper] 50.0 36.2 42.0 CVPR2017
Baseline [paper] 33.0 40.0 36.0 ICDAR2017
SegLink [paper] 30.3 23.8 26.7 CVPR2017
ckchng commented 5 years ago

Hi there, we believe that most of the works in the table you referred use the default values, tr = 0.8 and tp =0.4, apart from FTSN which uses Pascal VOC IoU metric.

We are currently asking authors (in the table) to send us their detection output so we can evaluate their result with tr = 0.7 and tr =0.6 (which we found are a better value in terms discouraging methods with loose detection box).

FYI, we are currently updating the table with our re-evaluation. However, we can't guarantee when will it be done since we haven't get all the authors' replies yet. Hope this helps.

lillyPJ commented 5 years ago

When I use tr = 0.8 and tp = 0.4, I found if I expand the boundary of detection polygons, the score will be much better, which is not consistent with the visual effect. Can you check your code for this situation? Or I can send you two different results to compare.

lillyPJ commented 5 years ago

I upload my results to https://pan.baidu.com/s/16S66fcY9cPYm2LY7s3ovlg (code = 9xku). My result is below (tested by your official Matlab-code).

  1. no_expand tr = 0.8, tp = 0.4:Recall = 74.113, Precision = 81.710, F-score = 77.727 tr = 0.7, tp = 0.6: Recall = 78.901, Precision = 85.724, F-score = 82.171
  2. expand tr = 0.8, tp = 0.4:Recall = 80.816, Precision = 88.434, F-score = 84.454 tr = 0.7, tp = 0.6: Recall = 51.578, Precision = 57.197, F-score = 54.242
ckchng commented 5 years ago

This is exactly the reason why propose the new threshold values. We found this in our experiment as well. The old values are too loose for our tight polygon ground truth format. The new threshold values are meant to discourage loose bounding box prediction. We thank you for your valuable example and your findings at Baidu Cloud.

If you are concerned about the inconsistency in your comparison (i.e. different set of thresholds used by other methods), we suggest you include both results in your manuscript and explain it accordingly. We will update our comparison table soon (with 0.7 and 0.6), since they are now the official values for Total-Text.