OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
39 stars 11 forks source link

Index out of range in tessapi.AllWordConfidences() #20

Closed finkf closed 6 years ago

finkf commented 6 years ago

The calculation of the word confidence fails if the returned list is empty (this happens with the Fraktur model for blumbach_anatomie_1805_0049.xml). I am not sure why the confidence list is empty and what the best way to fix this is.

But maybe it is sufficient to just set the word_conf to 0.0 if the returned list is empty.

Error trace:

Traceback (most recent call last):
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/ocrd-tesserocr-recognize", line 11, in <module>
    load_entry_point('ocrd-tesserocr', 'console_scripts', 'ocrd-tesserocr-recognize')()
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/ocrd_tesserocr/ocrd_tesserocr/cli.py", line 27, in ocrd_tesserocr_recognize
    return ocrd_cli_wrap_processor(TesserocrRecognize, *args, **kwargs)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/ocrd/decorators.py", line 28, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/ocrd/processor/base.py", line 63, in run_processor
    processor.process()
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 153, in process
    word_conf = tessapi.AllWordConfidences()[0]/100.0
IndexError: list index out of range
bertsky commented 6 years ago

After fighting with OCR-D/core#176 as well, and figuring you meant recognizing at the word level from GT-PAGE layout segmentation, I can reproduce.

It happens at word id w_w1aab1b3b4b9c11b2b3, which has a dash () annotated in GT. So probably the forced recognition mode (using given Word coordinates in PSM.SINGLE_WORD) just fails to recognize this as a valid single word.

Of course, that code should have asked whether there actually was a result at all. Thanks for reporting this!

This concerns both the word level and the glyph level. Confidences are still problematic at the line level, too.

Want me to take that invitation?

bertsky commented 6 years ago

I have a solution (#21) for the former two. But still no idea what to do about line level confidences. MeanTextConf seems even more wrong to me there. If only one could have a look at PRImA Lab's Tesseract to Page exporter for reference. But this is a (Windows) binary blob.

bertsky commented 6 years ago

Looking more into it, Tesseract seams to always calculate arithmetic averages (not geometric averages or totals/products) for confidence, too. (Its native score ranges [-20,0], which is then converted (and clipped) to [0,100] when calling Confidence() on an iterator or AllWordConfidences() on the base API handle.) So MeanTextConf() is correct (consistent) for aggregate scores like line level confidences.

kba commented 6 years ago

Should be fixed by https://github.com/OCR-D/ocrd_tesserocr/pull/22, right? Feel free to reopen if not.