OCR on line vs word level

bertsky commented 5 years ago

The current specification is agnostic about which level of segmentation OCR is supposed to operate on, either TextLine layout input (for TextLine, Word or Glyph text output), or Word layout input (for Word or Glyph text output). Arguably, this is a question of workflow configuration.

But if formatting / font features are annotated (as essential part of the workflow) up to the Word level – see #76 –, then OCR will need to proceed on the word level, too: To make use of the (word-level) font features in its input (improving recognition), and to be able to match the previous (word-level) annotation in its output (so font features can be re-used for post-correction or by applications). (Simply creating a new annotation with different Word elements would make it hard if not impossible to re-combine the different PAGE annotations.)

On the other hand, if OCR does not proceed on the line level, its results will probably be suboptimal: Engines want to be free to choose the best word segmentation based on their own models, including dictionaries and language models. In order to still apply OCR-internal language modelling at all with word-level recognition, their API would need some way of pointing them to the context of previous results, or become stateful.

Furthermore, when multiple OCR results are aligned for post-correction, allowing word segmentation ambiguity can actually help – see #72. But that requires staying on the line level. If instead OCR lines are aligned on the word level (i.e. preserving any one of its Word segmentations, deemed "master OCR" by CIS), then wrong segmentation will stay uncorrected. But then at least word-based font features can be used as input and kept in the output.

We already had a related discussion on the correct behaviour of the Tesseract wrapper (internal vs external Word layout), but I am afraid this issue is bigger. Or am I mistaken, are there other options? Can we at least get a clearer picture of this?

kba commented 5 years ago

Algorithms should declare their "level of operation", whether they OCR on lines or on words, or merge OCR results under the assumption that all input has the same word segmentation or not.

@tboenig and I had some discussions about "Erfassungstiefe" to annotate PAGE and METS with. In essence, how to succinctly express that "this PAGE file has been segmented until word level with font annotations on the word level but without actual OCR" or "this METS file comes with logical page-spanning structure information".

As for the conflicts that arise (word segmentation): No existing PAGE should be modified so from a workflow perspective, there is no need to have them aligned. You could use the (word-level) font annotations of the input PAGE to optimize your algorithm/choose the right model etc. and output a new PAGE file with differing word segmentation or no segmentation at all. IIRC @wrznr prefers this generally over serializing differing segmentations in a single PAGE file. How such conflicts are serialized in a single PAGE file would be #72.

I would not require much in the specs about the relation of "processing level" of input and output files beyond that it must be documented what is expected and what will be produced. I.e. I would NOT demand that output PAGE must have the same or any word segmentation if the input PAGE file had any.

bertsky commented 5 years ago

Algorithms should declare their "level of operation"

Where should they do so, in their ocrd-tool.json perhaps? How does workflow configuration get to know otherwise? And what if they are flexible in that respect, should they provide something like a textequiv_level parameter then?

discussions about "Erfassungstiefe" to annotate PAGE and METS with

But why even annotate that in the output, would it not suffice if the workflow engine knew, setting all the steps' parameters consistently? (If some processor had a textequiv_level, then workflow engine would set that accordingly. Otherwise workflow configuration would make sure to provide it with sufficient input.)

No existing PAGE should be modified so from a workflow perspective, there is no need to have them aligned.

I disagree. This was my major point above: If processors produce competing segmentation (i.e. different layout/structure with different content), then one cannot reliably re-align in all but a few trivial cases. How do you put together different, partially overlapping polygons? How do you put together Word/TextStyle and Word/TextEquiv at those Word/Coords? For pure text annotation (i.e. when both inputs' elements contain Word/TextEquiv), where full alignment is at least possible (by way of edit distance), serialization is the only way, and it already breaks Word/TextStyle and Word/Coords.

I would NOT demand that output PAGE must have the same or any word segmentation if the input PAGE file had any.

But aren't those really two different cases? If a processor produces annotation that is coarser in its output than its input, this is okay from a consistency/aggregation perspective: As long as there is some purpose to that coarser level at all, it can trivially be aligned with other annotations. But allowing inconsistent segmentation (competing on whatever level, Word, TextLine, or TextRegion) makes it impossible for consumers to re-align – all but one lower-level annotation will have to be thrown away.

If I am still correct with this re-alignment problem, then what follows is the dilemma sketched above: We can have

either font features up to the Word scale (with possible use cases at OCR/post-correction/applications),
or good OCR results based on OCR-internal LM, and good (whitespace-agnostic) multi-OCR alignment.

There may be another way out, though: Font features get annotated in a first pass on the TextLine level (purely for OCR input). Then OCR does word segmentation (based on its internal LM). Then font features are annotated in a second pass on the Word level (for post-correction and applications). Post-correction can now pick up only one OCR+font annotation. But even if it wants to align multiple annotations, if the serialization result was represented carefully (with more than #72), one might be able to re-align it with the original annotations: Each Glyph/TextEquiv/@custom would need to contain not just the identifier of its source annotation (which OCR), but also the identifier of the Word/@id there (which element in that source), and for whitespace glyphs additionally the left-half Word/Coords/@points of the follow-up word and right-half points of the previous word. With this scheme I believe, it should in principle be possible to put together (without guessing) a valid resulting annotation which does include TextStyle and Coords.

How about me and @finkf providing a proof of concept implementation for that kind of text alignment? Would that help shifting your inclination towards a controlled annotation depth architecture?

OCR-D / spec

OCR on line vs word level #77