Closed finkf closed 6 years ago
After fighting with OCR-D/core#176 as well, and figuring you meant recognizing at the word level from GT-PAGE layout segmentation, I can reproduce.
It happens at word id w_w1aab1b3b4b9c11b2b3
, which has a dash (—
) annotated in GT. So probably the forced recognition mode (using given Word coordinates in PSM.SINGLE_WORD
) just fails to recognize this as a valid single word.
Of course, that code should have asked whether there actually was a result at all. Thanks for reporting this!
This concerns both the word level and the glyph level. Confidences are still problematic at the line level, too.
Want me to take that invitation?
I have a solution (#21) for the former two. But still no idea what to do about line level confidences. MeanTextConf
seems even more wrong to me there. If only one could have a look at PRImA Lab's Tesseract to Page exporter for reference. But this is a (Windows) binary blob.
Looking more into it, Tesseract seams to always calculate arithmetic averages (not geometric averages or totals/products) for confidence, too. (Its native score ranges [-20,0]
, which is then converted (and clipped) to [0,100]
when calling Confidence()
on an iterator or AllWordConfidences()
on the base API handle.) So MeanTextConf()
is correct (consistent) for aggregate scores like line level confidences.
Should be fixed by https://github.com/OCR-D/ocrd_tesserocr/pull/22, right? Feel free to reopen if not.
The calculation of the word confidence fails if the returned list is empty (this happens with the
Fraktur
model forblumbach_anatomie_1805_0049.xml
). I am not sure why the confidence list is empty and what the best way to fix this is.But maybe it is sufficient to just set the
word_conf
to0.0
if the returned list is empty.Error trace: