Open bertsky opened 5 years ago
Algorithms should declare their "level of operation", whether they OCR on lines or on words, or merge OCR results under the assumption that all input has the same word segmentation or not.
@tboenig and I had some discussions about "Erfassungstiefe" to annotate PAGE and METS with. In essence, how to succinctly express that "this PAGE file has been segmented until word level with font annotations on the word level but without actual OCR" or "this METS file comes with logical page-spanning structure information".
As for the conflicts that arise (word segmentation): No existing PAGE should be modified so from a workflow perspective, there is no need to have them aligned. You could use the (word-level) font annotations of the input PAGE to optimize your algorithm/choose the right model etc. and output a new PAGE file with differing word segmentation or no segmentation at all. IIRC @wrznr prefers this generally over serializing differing segmentations in a single PAGE file. How such conflicts are serialized in a single PAGE file would be #72.
I would not require much in the specs about the relation of "processing level" of input and output files beyond that it must be documented what is expected and what will be produced. I.e. I would NOT demand that output PAGE must have the same or any word segmentation if the input PAGE file had any.
Algorithms should declare their "level of operation"
Where should they do so, in their ocrd-tool.json
perhaps? How does workflow configuration get to know otherwise? And what if they are flexible in that respect, should they provide something like a textequiv_level
parameter then?
discussions about "Erfassungstiefe" to annotate PAGE and METS with
But why even annotate that in the output, would it not suffice if the workflow engine knew, setting all the steps' parameters consistently? (If some processor had a textequiv_level
, then workflow engine would set that accordingly. Otherwise workflow configuration would make sure to provide it with sufficient input.)
No existing PAGE should be modified so from a workflow perspective, there is no need to have them aligned.
I disagree. This was my major point above: If processors produce competing segmentation (i.e. different layout/structure with different content), then one cannot reliably re-align in all but a few trivial cases. How do you put together different, partially overlapping polygons? How do you put together Word/TextStyle
and Word/TextEquiv
at those Word/Coords
? For pure text annotation (i.e. when both inputs' elements contain Word/TextEquiv
), where full alignment is at least possible (by way of edit distance), serialization is the only way, and it already breaks Word/TextStyle
and Word/Coords
.
I would NOT demand that output PAGE must have the same or any word segmentation if the input PAGE file had any.
But aren't those really two different cases? If a processor produces annotation that is coarser in its output than its input, this is okay from a consistency/aggregation perspective: As long as there is some purpose to that coarser level at all, it can trivially be aligned with other annotations. But allowing inconsistent segmentation (competing on whatever level, Word
, TextLine
, or TextRegion
) makes it impossible for consumers to re-align – all but one lower-level annotation will have to be thrown away.
If I am still correct with this re-alignment problem, then what follows is the dilemma sketched above: We can have
Word
scale (with possible use cases at OCR/post-correction/applications), There may be another way out, though: Font features get annotated in a first pass on the TextLine
level (purely for OCR input). Then OCR does word segmentation (based on its internal LM). Then font features are annotated in a second pass on the Word
level (for post-correction and applications). Post-correction can now pick up only one OCR+font annotation. But even if it wants to align multiple annotations, if the serialization result was represented carefully (with more than #72), one might be able to re-align it with the original annotations: Each Glyph/TextEquiv/@custom
would need to contain not just the identifier of its source annotation (which OCR), but also the identifier of the Word/@id
there (which element in that source), and for whitespace glyphs additionally the left-half Word/Coords/@points
of the follow-up word and right-half points of the previous word. With this scheme I believe, it should in principle be possible to put together (without guessing) a valid resulting annotation which does include TextStyle
and Coords
.
How about me and @finkf providing a proof of concept implementation for that kind of text alignment? Would that help shifting your inclination towards a controlled annotation depth architecture?
The current specification is agnostic about which level of segmentation OCR is supposed to operate on, either
TextLine
layout input (forTextLine
,Word
orGlyph
text output), orWord
layout input (forWord
orGlyph
text output). Arguably, this is a question of workflow configuration.But if formatting / font features are annotated (as essential part of the workflow) up to the
Word
level – see #76 –, then OCR will need to proceed on the word level, too: To make use of the (word-level) font features in its input (improving recognition), and to be able to match the previous (word-level) annotation in its output (so font features can be re-used for post-correction or by applications). (Simply creating a new annotation with differentWord
elements would make it hard if not impossible to re-combine the different PAGE annotations.)On the other hand, if OCR does not proceed on the line level, its results will probably be suboptimal: Engines want to be free to choose the best word segmentation based on their own models, including dictionaries and language models. In order to still apply OCR-internal language modelling at all with word-level recognition, their API would need some way of pointing them to the context of previous results, or become stateful.
Furthermore, when multiple OCR results are aligned for post-correction, allowing word segmentation ambiguity can actually help – see #72. But that requires staying on the line level. If instead OCR lines are aligned on the word level (i.e. preserving any one of its
Word
segmentations, deemed "master OCR" by CIS), then wrong segmentation will stay uncorrected. But then at least word-based font features can be used as input and kept in the output.We already had a related discussion on the correct behaviour of the Tesseract wrapper (internal vs external
Word
layout), but I am afraid this issue is bigger. Or am I mistaken, are there other options? Can we at least get a clearer picture of this?