also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py

bertsky commented 6 years ago

For OCR postcorrection, TextLine.Word.Glyph.TextEquiv can be more valueable than just TextLine.TextEquiv. It allows to build up a lattice (or rather, confusion network) of alternative character hypotheses to (re)build words and phrases from. The PAGE notion of character hypotheses is glyph variants, i.e. a sequence of TextEquiv with index and conf (confidence) attributes. This does not help in addressing segmentation ambiguity (especially on the word level, since PAGE enforces a hierarchy of Word). But most ambiguity on the character level can still be captured.

Example:

<TextLine id="...">
  <Coords points="..."/>
  <Word id="...">
    <Coords points="..."/>
    <Glyph id="...">
      <Coords points="..."/>
      <TextEquiv>
        <Unicode>a</Unicode>
      </TextEquiv>
    </Glyph>
    <Glyph id="...">
      <Coords points="..."/>
      <TextEquiv index="0" conf="0.6">
        <Unicode>m</Unicode>
      </TextEquiv>
      <TextEquiv index="1" conf="0.3">
        <Unicode>rn</Unicode>
      </TextEquiv>
      <TextEquiv index="2" conf="0.1">
        <Unicode>in</Unicode>
      </TextEquiv>
    </Glyph>
  </Word>
  <Word id="...">
    ...
  </Word>
  <TextEquiv>
    <Unicode>am Ende</Unicode>
  </TextEquiv>
</TextLine>

So this part of the wrapper should also dive into the word and character/glyph substructure as a complementary level of annotation. Tesseract's API seems to be straightforward for this use case: baseapi.h contains GetIterator() giving a ResultIterator, which allows to recurse across RIL_SYMBOL as PageIteratorLevel. For each glyph then a GetUTF8Text() and Confidence() yield what we need.

kba commented 6 years ago

Thanks for you feedback, much appreciated, short update on progress: We've been laying the groundwork last week to make this possible by making the PAGE library in OCR-D/core more flexible. Now we're discussing how to parameterize segmentation/recognition so the Tesseract API can be used on different levels with different langdata etc. Once that is settled, recognizing on glyph level and keeping confidence should be straightforward to implement.

bertsky commented 6 years ago

Implemented a prototype. This does add glyph annotation with confidences, but alternative hypotheses are only available with Tesseract 3 models (while LSTM models just give the first reading).

Not sure if that change can help.

bertsky commented 6 years ago

Not sure if that change can help.

This and related tesseract-ocr/tesseract#1851 and tesseract-ocr/tesseract#1997 do not in fact service the old API (LTRResultIterator::GetChoiceIterator), but instead merely introduce a new function (TessBaseAPI::GetBestLSTMChoices), which will become available with release 4.0. So either we adapt to that, or we wait for the old API to be fixed as well.

kba commented 6 years ago

@noahmetzger @bertsky What's the status here?

bertsky commented 6 years ago

Still the same as 3 weeks ago AFAICT. GetBestLSTMChoices is a good start (especially for independent experiments), but I still hesitate to adapt to it here: I would like to keep both old (pre-LSTM) and new models running and producing consistent results. Maybe if we could at least query the API about old vs new, then we could attempt a different backend. But I do not see any. (There is get_languages (giving the names of the models) and the OEM class, corresponding to TessOcrEngineMode enum. But nothing that says "this language is that mode".)

What is worse, we cannot go further as long as tesserocr does not migrate to the 4.0 codebase. (It does not currently build at all.) See here for Noah's PR, which also needs to be updated.

noahmetzger commented 5 years ago

I will update the PR this or next week, but we are currently working on another part of the Project.

wrznr commented 5 years ago

@noahmetzger @bertsky What's the status here?

stweil commented 5 years ago

@noahmetzger can be reached again after Easter Monday. His main focus is currently implementing the requirements for @bertsky in Tesseract.

bertsky commented 5 years ago

Noah did make changes that populate the old iterator API on LSTM engine, which have been merged already. But as I argued elsewhere, this cannot produce correct scores (for everything besides the best path) and may produce illegal characters (because it does not respect the incremental character encoding of the beam).

Also, when trying lattice output, character whitelisting and user patterns/words I observed that the current beam search is too narrow anyway.

So we are currently working on 2 possible solutions:

Noah will try to find a way to correctly re-integrate partial hypotheses (which fell off the beam), and increase the overly restrictive entrance width (which currently only uses the top 2 outputs per timestep)
I will try to rewrite the beam search from its current depth-first to a breadth-first approach (iterating over a pool of candidates, sorted by a score normalised with their length and prospective cost – A* search –, instead of iterating over timesteps strictly left to right), which should be faster and better (judging from my LM and post-correction experiences with beam search), and would also give better control of the search effort

OCR-D / ocrd_tesserocr

also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7