Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
Apache License 2.0
1.04k stars 209 forks source link

calamari-eval: confusion table miscalculates relative frequency #325

Closed bertsky closed 2 years ago

bertsky commented 2 years ago

Using Calamari 2.2.2 I sometimes get strange reports from calamari-eval:

Got mean normalized label error rate of 0.05% (222 errs, 478306 total chars, 228 sync errs)
GT       PRED     COUNT    PERCENT   
{ }      {}              3      1.32%
{.}      {}              3      1.32%
{J}      {j}             2      0.88%
{a}      {}              2      0.88%
{}       {.}             2      0.88%
{afoj}   {}              2      3.51%
{,}      {}              2      0.88%
{ot }    {}              1      1.32%
{á}      {a}             1      0.44%
{Štó smědźeše hrody twarić?} {k}             1     11.40%

Here, it seems that all the entries with more than 1 character have wrong figures in the PERCENT column.

(I do seem to have wrong segmentation in my data, causing such long confusions. But the point is that absolute to relative conversion should not depend on the length of the alignment.)

andbue commented 2 years ago

The percentage there is calculated as count * max(len(gt), len(pred)) for the confusion in question divided by sum(max(len(gt_str), len(pred_str))) for all the confusions in the dataset. The label "PERCENT" does not really mirror this complexity, I guess... Just calculating the percentage of the count against total_count might be a bit more consistent. On the other hand, the current implementation shows how much of the total CER is actually caused by a particular confusion. Would the label PERCENT_CER be more appropriate?

bertsky commented 2 years ago

Ah, got it! I agree this figure is actually more relevant than simple relative count.

Would the label PERCENT_CER be more appropriate?

Yes, absolutely! That would make it clear immediately.

bertsky commented 2 years ago

completed in Calamari-OCR/calamari@5a46ca1

many thanks!