Closed bertsky closed 4 years ago
Thanks for you feedback, much appreciated, short update on progress: We've been laying the groundwork last week to make this possible by making the PAGE library in OCR-D/core more flexible. Now we're discussing how to parameterize segmentation/recognition so the Tesseract API can be used on different levels with different langdata etc. Once that is settled, recognizing on glyph level and keeping confidence should be straightforward to implement.
Implemented a prototype. This does add glyph annotation with confidences, but alternative hypotheses are only available with Tesseract 3 models (while LSTM models just give the first reading).
Not sure if that change can help.
Not sure if that change can help.
This and related tesseract-ocr/tesseract#1851 and tesseract-ocr/tesseract#1997 do not in fact service the old API (LTRResultIterator::GetChoiceIterator
), but instead merely introduce a new function (TessBaseAPI::GetBestLSTMChoices
), which will become available with release 4.0. So either we adapt to that, or we wait for the old API to be fixed as well.
@noahmetzger @bertsky What's the status here?
Still the same as 3 weeks ago AFAICT. GetBestLSTMChoices
is a good start (especially for independent experiments), but I still hesitate to adapt to it here: I would like to keep both old (pre-LSTM) and new models running and producing consistent results. Maybe if we could at least query the API about old vs new, then we could attempt a different backend. But I do not see any. (There is get_languages
(giving the names of the models) and the OEM
class, corresponding to TessOcrEngineMode
enum. But nothing that says "this language is that mode".)
What is worse, we cannot go further as long as tesserocr does not migrate to the 4.0 codebase. (It does not currently build at all.) See here for Noah's PR, which also needs to be updated.
I will update the PR this or next week, but we are currently working on another part of the Project.
@noahmetzger @bertsky What's the status here?
@noahmetzger can be reached again after Easter Monday. His main focus is currently implementing the requirements for @bertsky in Tesseract.
Noah did make changes that populate the old iterator API on LSTM engine, which have been merged already. But as I argued elsewhere, this cannot produce correct scores (for everything besides the best path) and may produce illegal characters (because it does not respect the incremental character encoding of the beam).
Also, when trying lattice output, character whitelisting and user patterns/words I observed that the current beam search is too narrow anyway.
So we are currently working on 2 possible solutions:
For OCR postcorrection,
TextLine.Word.Glyph.TextEquiv
can be more valueable than justTextLine.TextEquiv
. It allows to build up a lattice (or rather, confusion network) of alternative character hypotheses to (re)build words and phrases from. The PAGE notion of character hypotheses is glyph variants, i.e. a sequence ofTextEquiv
withindex
andconf
(confidence) attributes. This does not help in addressing segmentation ambiguity (especially on the word level, since PAGE enforces a hierarchy ofWord
). But most ambiguity on the character level can still be captured.Example:
So this part of the wrapper should also dive into the word and character/glyph substructure as a complementary level of annotation. Tesseract's API seems to be straightforward for this use case:
baseapi.h
containsGetIterator()
giving aResultIterator
, which allows to recurse acrossRIL_SYMBOL
asPageIteratorLevel
. For each glyph then aGetUTF8Text()
andConfidence()
yield what we need.