OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

allow intermediate PAGE annotation for word segmentation ambiguity #72

Open bertsky opened 6 years ago

bertsky commented 6 years ago

In the OCR-D workflow, there are several steps that likely require input or output to be able to represent word segmentation ambiguity and confidence values of word boundaries (whitespace characters):

  1. Each individual OCR engine might be able to provide such information. Postcorrection could benefit from it (importing whitespace characters with confidence as but one of many alternative characters within the input graph). Tesseract LSTM models definitely have this, at least internally (see OCR-D/ocrd_tesserocr#7 which is distinct but related). PAGE output needs to incorporate those without loss.

  2. Character-based language models need whitespace characters both for input and for output, at least when going below the TextLine level: For input, because the tokenization implicit in Word annotation is not guaranteed to be a trivial whitespace join (especially at punctuation). And for output, because the LM's whitespace probabilities would have to be thrown at (the confidence attributes of) neighbouring elements otherwise.

  3. Alignment of multiple OCR results produces symbol pairings including the empty symbol (insertion/deletion) and whitespace symbol (tokenization ambiguity) for each line. Since there is no reserved character for the empty symbol within TextLine:TextEquiv:Unicode – all of Unicode is possible here, except control characters, which are forbidden in CDATA –, we cannot use it to encode such alignments. We do have a natural representation for empty symbol at Glyph:TextEquiv:Unicode, but Glyph already necessitates a strict (hierarchical) Word segmentation, which would break tokenization ambiguity again.

Thus, it seems advisable to allow a PAGE annotation as interface at those particular steps which deviates from the standard in that

The alternative would be to either not use PAGE at all there or loose important information by design.

Example for multi-OCR alignment:

<TextLine>
 <Word> <!-- only 1 Word per TextLine -->
  <Glyph> <!-- glyph segmentation ambiguity within OCR (alignment by non-uniform length) -->
   <TextEquiv index="0" conf="0.9" dataType="ocr1">m</TextEquiv>
   <TextEquiv index="1" conf="0.8" dataType="ocr1">ni</TextEquiv>
   <TextEquiv index="2" conf="0.4" dataType="ocr1">ri</TextEquiv>
   <TextEquiv index="3" conf="0.3" dataType="ocr1">rn</TextEquiv>
   <TextEquiv index="4" conf="0.7" dataType="ocr2">n</TextEquiv>
   <TextEquiv index="5" conf="0.6" dataType="ocr2">r</TextEquiv>
  </Glyph>
  <Glyph> <!-- glyph segmentatoin ambiguity between OCR (alignment with empty symbol) -->
   <TextEquiv index="0" conf="1.0" dataType="ocr1"></TextEquiv>
   <TextEquiv index="1" conf="0.9" dataType="ocr2">i</TextEquiv>
   <TextEquiv index="2" conf="0.6" dataType="ocr2">r</TextEquiv>
  </Glyph>
  <Glyph>
   <TextEquiv index="0" conf="0.9" dataType="ocr1">y</TextEquiv>
   <TextEquiv index="1" conf="0.9" dataType="ocr2">v</TextEquiv>
   <TextEquiv index="2" conf="0.6" dataType="ocr2">y</TextEquiv>
  </Glyph>
  <Glyph> <!-- word segmentation ambiguity between OCR (alignment with explicit space) -->
   <TextEquiv index="0" conf="0.8" dataType="ocr1"> </TextEquiv>
   <TextEquiv index="1" conf="0.9" dataType="ocr2">,</TextEquiv>
  </Glyph>
  <Glyph>
   <TextEquiv index="0" conf="0.9" dataType="ocr1">p</TextEquiv>
   <TextEquiv index="1" conf="0.9" dataType="ocr2">o</TextEquiv>
  </Glyph>
  <Glyph>
   <TextEquiv index="0" conf="0.9" dataType="ocr1">a</TextEquiv>
   <TextEquiv index="1" conf="0.9" dataType="ocr2">e</TextEquiv>
  </Glyph>
  <Glyph>
   <TextEquiv index="0" conf="0.9" dataType="ocr1">y</TextEquiv>
   <TextEquiv index="1" conf="0.9" dataType="ocr2">y</TextEquiv>
  </Glyph>
 </Word> <!-- end of the line -->
</TextLine>

@kba @wrznr @finkf @lschiffer

kba commented 6 years ago

Input from different OCR engines should be different PAGE-XML files in the "standard" format where words are annotated not glyphs and whitespace. I'd suggest to have a top-level mechanism to define this different "profile", since the semantics change if a PAGE consumer is suposed to expect whitespace in words. Maybe a keyword in the PcGts/@custom attribute?

To represent glyph alternatives indeed requires <Glyph> and to represent those directly below <TextLine> requires dummy intermediary <Word>.

Should this format handle line segmentation alignment?

Probably all elements should have an ID, to make it possible to re-map them to the "standard" format sources or to reference them directly in addition to by @index.

As for the requirements for representation, @finkf can probably offer the most informed opinion.

kba commented 6 years ago

Could the PAGE Layout Evaluation XML format be helpful here? https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/layout-evaluation/schema/layouteval.xsd

finkf commented 6 years ago

I skimmed over it. But I did not see anything related to text. It seems to be a layout evaluation schema. What did I miss?

kba commented 6 years ago

Also just skimmed it. The mechanism for classifying errors in there could be helpful to make the alignment more explicit and less conventional etc.

You are the experts, if it works for you and is documented/reusable/standards-compliant, go for it :)

bertsky commented 6 years ago

@kba Point 3 was about representation of the output of multi-OCR alignment on the line level (not on the page level), so it's about word segmentation (not line segmentation).

@finkf When we discussed this last week, your 2 ideas of how to do this with PAGE were all losing information (Word level), or trying to assume we can reserve some characters (codepoints) for the empty symbol (TextLine level). So I came up with this. Do you think you can change cisocrgroup/cis-ocrd-py along those lines? (Or should I give it a try?) This proposal goes beyond merely representing alignment though, cf. points 1 and 2. Remember we also want glyph alternatives from OCR (aligned or not), and still aspire to use PAGE to integrate the language model.

@kba As far as I understand it, there is no custom attribute for PcGts. There is only pcGtsId and the Metadata element. Or did you mean PcGts/Page/@custom? I am very much in favour of a formal / top-level way to mark those unconventional semantics.

After a glance at layouteval.xsd I side with @finkf in that this does not help here IMHO. It can represent segmentation errors on all levels for sure. But we want to bring together different segmentations, and irrespective of which is 'right'. Same goes for error metric vs confidence level.

cneud commented 6 years ago

Ping @splet @chris1010010 for further opinions.

chris1010010 commented 6 years ago

@bertsky @kba @cneud @finkf @splet A complex discussion :-) I'm happy to help, if I can. I agree that the Layout Evaluation schema doesn't really fit. Every object (text line, word, glyph etc.) can have multiple user-defined attributes (an extension of the 'custom' attribute so to speak). Also there is MetaDataItem now for custom metadata. A convention with one word per line sounds feasible.

wrznr commented 5 years ago

@tboenig We had a lively discussion about this one in Rostock. Personally, I do not like the one-word-per-line option for aesthetic reasons. Assigning myself as a reminder to explore other options namely ways to represent n-fold alignments within XML.

bertsky commented 5 years ago

Some considerations which might help to swing the decision between a specialised PAGE annotation (with deviating semantics) and a new customised XML format:

pro PAGE:

  1. We already use it everywhere else. Libraries for parsing and generating can be kept, importing and exporting would be no extra effort, and switching between segmentation-ambiguous and standard workflows would be easier for the components concerned with this.
  2. We can probably make this self-contained, i.e. pass enough other elements and attributes (Coords/@points, TextStyle, TextRegion/@type etc) to be able to produce a full/rich output from it again (perhaps having to solve a puzzle of input Glyph co-ordinates for output Word co-ordinates).

contra PAGE:

  1. Although intended purely as an intermediate annotation, and perhaps even marked for its special semantics in a MetadataItem element, some of these files might "survive" and could end up being misinterpreted as regular PAGE files by an unsuspecting component/user.
  2. We can probably not make this self-contained anyway, i.e. need to combine that annotation with its constituent annotations to be able to produce rich output again (perhaps by using @id pointers).

Point 2 goes both ways and needs to be cleared up first IMO.

wrznr commented 5 years ago

@bertsky Could you please provide self-contained examples which may help us in developing a solution as discussed in Rostock (i.e. with another level of representation effectively preventing the abuse of line)?

bertsky commented 5 years ago

Sorry to get back so late, but this problem seems to be a Gordic knot of sorts. Getting good real-life example data entails having some OCR which can already give these representations, which in turn entails taking action to extend the API of tesseract, ocropy or another promising candidate, for which of course a good illustrative example (and visualisation) would be invaluable.

Before I put more effort into that, please consider the following proposal for an extension of PAGE for lattices – so that we get an adequate representation of alternative segmentations without cheating.

Representing a graph in a structure-oriented XML schema like PAGE is impossible: it can only describe a tree. So one needs a pointer-oriented schema. GraphML is a popular example. For PAGE this means we should introduce:

And this should be possible on any level of granularity, depending on use-cases. So here it goes:

  1. a new element type TextSegmentLattice which may appear (0-1 times) in a TextRegion, TextLine, or Word, and may contain sequences of (0-n times) both
  2. a new element type TextSegmentNode with an id attribute,
  3. a new element type TextSegmentArc with begin and end attributes, in addition to the attributes of the Word element type, and which must have one Coords, and may also contain a TextStyle and a sequence (0-n times) of TextEquiv

The semantics of this would be straightforward. The arcs begin and end on the respective nodes, all nodes must be connected, there must not be circles etc. Note that the lattice is a terminal level – when used on some hierarchy level, it replaces the normal option available on that level, which can go deeper.

wrznr commented 5 years ago

@bertsky Please provide a small XML snippet illustrating your proposal. @kba and I are very open to your suggestion.

bertsky commented 5 years ago

Sure! So here is what the above (artificial) example could look like:

<TextLine ...><Coords points=.../>
 <TextSegmentLattice begin="1" end="9"> <!-- instead of a sequence of Word -->
  <TextSegmentNode id="1">
  <TextSegmentNode id="2">
  <TextSegmentArc begin="1" end="2" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.9"><Unicode>m</Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentNode id="3">
  <TextSegmentArc begin="1" end="3" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.75"><Unicode>n</Unicode></TextEquiv>
    <TextEquiv index="1" conf="0.65"><Unicode>r</Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentArc begin="3" end="2" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.9"><Unicode>i</Unicode></TextEquiv>
    <TextEquiv index="1" conf="0.6"><Unicode>r</Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentNode id="4">
  <TextSegmentArc begin="2" end="4" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.9"><Unicode>y</Unicode></TextEquiv>
    <TextEquiv index="1" conf="0.8"><Unicode>v</Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentNode id="5"> <!-- explicit space: -->
  <TextSegmentArc begin="4" end="5" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.9"><Unicode> </Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentNode id="6">
  <TextSegmentArc begin="5" end="6" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.9"><Unicode>p</Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentNode id="7">
  <TextSegmentArc begin="4" end="7" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.8"><Unicode> ,</Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentArc begin="7" end="6" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.9"><Unicode>o</Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentNode id="8">
  <TextSegmentArc begin="6" end="8" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.9"><Unicode>a</Unicode></TextEquiv>
    <TextEquiv index="0" conf="0.7"><Unicode>e</Unicode></TextEquiv>
  </TextSegmentArc>
  <TextSegmentNode id="9">
  <TextSegmentArc begin="8" end="9" id=... language=...><Coords points=.../>
    <TextEquiv index="0" conf="0.9"><Unicode>y</Unicode></TextEquiv>
  </TextSegmentArc>
 </TextSegmentLattice>
 <TextEquiv><Unicode>my pay</Unicode></TextEquiv>
</TextLine>

dot graph visualisation

Allow me to elaborate a little.

By comparison, with the current purely hierarchical schema we had a tree of:

– with implicit white space (unless the old merge rules were to be re-activated).

Whereas with the proposed extension we would additionally get:

– with explicit white space (in TextEquiv), and possibly empty arcs (e.g. to get a single final node).

Of course, it depends on the use case what granularity level is chosen throughout documents, or even mixed within single pages.

But just what might the use cases be for this extension? As said in the opening statement, multi-OCR alignment and post-correction would benefit a lot. (Whereby post-correction could be both automatic or interactive or a combination thereof, i.e. supervised adaptation.) They are not impossible without true lattices. But the current, tree-shaped representation – which by ReadingOrder / textLineOrder / readingDirection and XML ordering convention corresponds to a Confusion Network or sausage lattice – lacks important information about segmentation ambiguity. In addition, there will be applications like key-word spotting that depend even more on the full lattice (if not the complete confidence matrix of possible outputs, which could be accommodated by some TextSegmentMatrix in a similar fashion).

What all those use cases have in common is that they need a representation for processing data (as opposed to GT data), they build upon the OCR search space (not the authoritative fulltext). This is of course not entirely new to PAGE, as TextEquiv lists already allow giving OCR alternatives. But it would be more rigorous with a lattice extension.

BTW, contrary to what one might expect from a first glance, this does not in fact worsen PAGE's consistency issue: As argued earlier, producing TextEquiv on multiple levels makes sense only for certain purposes:

  1. Readability by humans – "index 1 consistency" is enough here (and could be generalised to "best-path consistency").
  2. Avoid loosing granularity – void, since lattices allow precisely whatever granularity is needed.
  3. One-size-fits-all annotations for consumers at various levels – does not go well with OCR alternatives anyway; rather than that, workflow configuration should guide producers to fit the consumers' granularity requirements.

I would like to add that there is a recent proposal in ALTO regarding segmentation ambiguity as well (albeit not lattice-based and only for the Glyph level).

chris1010010 commented 5 years ago

I wonder if we should add options for custom extension in PAGE (like in ALTO), where you can put any custom XML elements that are exempt from validation.

See this from the ALTO schema:

        <xsd:element name="XmlData" minOccurs="0">
            <xsd:annotation>
                <xsd:documentation xml:lang="en">
                    The xml data wrapper element XmlData is used to contain XML encoded metadata.
                    The content of an XmlData element can be in any namespace or in no namespace.
                    As permitted by the XML Schema Standard, the processContents attribute value for the
                    metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are
                    identified by means of an XML schemaLocation attribute, then an XML processor will validate
                    the elements for which it can find declarations. If a source schema is not identified, or cannot be
                    found at the specified schemaLocation, then an XML validator will check for well-formedness,
                    but otherwise skip over the elements appearing in the XmlData element.
                </xsd:documentation>
            </xsd:annotation>
            <xsd:complexType>
                <xsd:sequence>
                    <xsd:any namespace="##any" processContents="lax" maxOccurs="unbounded"/>
                </xsd:sequence>
            </xsd:complexType>
        </xsd:element>
bertsky commented 5 years ago

Thanks, @chris1010010. It is, of course, your decision, but that option would also make it impossible to enforce the new element to be a terminal alternative to line, word and glyph sequences (via xsd:choice).

I have now made a draft version of the schema for my proposal on https://github.com/bertsky/PAGE-XML. The very last changeset adds the TextSegmentLattice stuff, and the 3 runup changesets are a cleanup for general readability (which might be of value independently). I can make a PR if you like me to.

BTW, I have also tried to reduce redundancy and code duplication by refactoring the shared elements into xsd:group and shared attributes into xsd:attributeGroup – mostly those of pc:RegionType, pc:TextRegionType, pc:TextLineType, pc:Word and pc:Glyph. (One cannot use xsd:extension for this, because inheritance is already used for pc:RegionType, and multiple inheritance is not allowed in XML Schema.) But this only makes sense if the xsd:sequence requirements were to be relaxed to xsd:all, i.e. if element types are not added in a specific order (not to be confused with element ordering). The reason is that elements would have to be shared at different positions (e.g. AlternativeImage and Coords more in the front, UserDefined and Labels towards the end, TextEquiv and TextStyle in between if applicable), which would render the schema illegible instead of improving clarity. Because of these difficulties, I did not publish that part (but can share it if anyone is interested).

chris1010010 commented 5 years ago

Okay, thanks @bertsky I will discuss with Stefan Pletschacher and Apostolos Antonacopoulos. We can consider for PAGE 2019-07-15

Does anyone have more information on:

chris1010010 commented 5 years ago

Dear all, After a lengthy discussion we decided not to add the lattice structure to the official PAGE schema. We understand the usefulness, but it does not fit well with the general purpose of PAGE to describe the content of a page (and not the intermediate OCR model). Also think the addition would make the format too complex (already very complex) for users and tools. Maybe a solution would be to store this information in a separate file? Sorry for that. Best regards, The PRImA Team

bertsky commented 5 years ago

@chris1010010 Dear Christian, too bad, but thanks for detailing your reasons!

Do you still want me to separate the 3 runup changesets mentioned above (without the actual lattice extension) and prepare a pull request from them? See here for a comparision including changelog.

@kba, Do you think we can still attempt an extension within OCR-D (perhaps under a different namespace, maybe by patching the schema on the fly in core)?

bertsky commented 5 years ago

@chris1010010 I now separated the lattice extension proposal from the other purely cosmetic commits and made a PR from the latter.

@kba I updated (with forced push) the lattice extension proposal, because I found a minor validation error which xmllint had not reported earlier. (Apparently, you cannot use attribute type ID more than once per element, so begin and end all have to be type string.) Do you want me to make a PR from that branch in your fork for OCR-D processing?

chris1010010 commented 5 years ago

@bertsky Thanks, I will have a look. I should have a bit more time now (post viva). Might still be a little while though (DATeCH conference...)

cneud commented 5 years ago

@bertsky fyi, within ALTO we are currently investigating CITlab's confidence matrices (or ConfMats) in view of possible lattice support. Are you perhaps familiar with ConfMat and how it relates to your proposal? Anyway I'll make sure to include your proposal (assuming here is most recent?) in the discussions.

bertsky commented 5 years ago

@cneud thanks for bringing this to attention. Yes, I am aware of CITlab's confmat approach/format. (In fact, I have linked to it above and mentioned the possibility to extend PAGE with a TextSegmentMatrix in similar fashion.) Confusion/confidence matrices also allow storing the search space of an OCR which is based on RNN+CTC.

But as argued in my proposal for Tesseract (yes, this is the most recent), I believe using a lattice instead of a matrix

  1. is more general (also applies to other OCR approaches)
  2. allows compression/pruning of the search space
  3. relieves the user of the necessity to CTC-decode that matrix to get any valid paths and scores, which in the case of Tesseract can be quite intricate.

(As for point 3, one could probably train a sequence-to-sequence decoder for that task as well, but I am not certain of it, and it still would not work in complicated cases like Tesseract.)

Please feel free to refute any of those points! I would be more than happy to get that discussion going again. (I am sorry to have let it die earlier with Gundram and Günter.) Is the discussion for ALTO public so I can catch up with it?

cneud commented 5 years ago

@bertsky Thanks for the elaboration. Initial discussions were held in the ALTO board meeting alongside DATeCH2019 last week but as soon as we have something public, I'll post the link here and let you know.