Open bertsky opened 6 years ago
Input from different OCR engines should be different PAGE-XML files in the "standard" format where words are annotated not glyphs and whitespace. I'd suggest to have a top-level mechanism to define this different "profile", since the semantics change if a PAGE consumer is suposed to expect whitespace in words. Maybe a keyword in the PcGts/@custom
attribute?
To represent glyph alternatives indeed requires <Glyph>
and to represent those directly below <TextLine>
requires dummy intermediary <Word>
.
Should this format handle line segmentation alignment?
Probably all elements should have an ID, to make it possible to re-map them to the "standard" format sources or to reference them directly in addition to by @index
.
As for the requirements for representation, @finkf can probably offer the most informed opinion.
Could the PAGE Layout Evaluation XML format be helpful here? https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/layout-evaluation/schema/layouteval.xsd
I skimmed over it. But I did not see anything related to text. It seems to be a layout evaluation schema. What did I miss?
Also just skimmed it. The mechanism for classifying errors in there could be helpful to make the alignment more explicit and less conventional etc.
You are the experts, if it works for you and is documented/reusable/standards-compliant, go for it :)
@kba Point 3 was about representation of the output of multi-OCR alignment on the line level (not on the page level), so it's about word segmentation (not line segmentation).
@finkf When we discussed this last week, your 2 ideas of how to do this with PAGE were all losing information (Word level), or trying to assume we can reserve some characters (codepoints) for the empty symbol (TextLine level). So I came up with this. Do you think you can change cisocrgroup/cis-ocrd-py along those lines? (Or should I give it a try?) This proposal goes beyond merely representing alignment though, cf. points 1 and 2. Remember we also want glyph alternatives from OCR (aligned or not), and still aspire to use PAGE to integrate the language model.
@kba As far as I understand it, there is no custom
attribute for PcGts
. There is only pcGtsId
and the Metadata
element. Or did you mean PcGts/Page/@custom
? I am very much in favour of a formal / top-level way to mark those unconventional semantics.
After a glance at layouteval.xsd I side with @finkf in that this does not help here IMHO. It can represent segmentation errors on all levels for sure. But we want to bring together different segmentations, and irrespective of which is 'right'. Same goes for error metric vs confidence level.
Ping @splet @chris1010010 for further opinions.
@bertsky @kba @cneud @finkf @splet A complex discussion :-) I'm happy to help, if I can. I agree that the Layout Evaluation schema doesn't really fit. Every object (text line, word, glyph etc.) can have multiple user-defined attributes (an extension of the 'custom' attribute so to speak). Also there is MetaDataItem now for custom metadata. A convention with one word per line sounds feasible.
@tboenig We had a lively discussion about this one in Rostock. Personally, I do not like the one-word-per-line option for aesthetic reasons. Assigning myself as a reminder to explore other options namely ways to represent n-fold alignments within XML.
Some considerations which might help to swing the decision between a specialised PAGE annotation (with deviating semantics) and a new customised XML format:
pro PAGE:
Coords/@points
, TextStyle
, TextRegion/@type
etc) to be able to produce a full/rich output from it again (perhaps having to solve a puzzle of input Glyph
co-ordinates for output Word
co-ordinates).contra PAGE:
MetadataItem
element, some of these files might "survive" and could end up being misinterpreted as regular PAGE files by an unsuspecting component/user. @id
pointers).Point 2 goes both ways and needs to be cleared up first IMO.
@bertsky Could you please provide self-contained examples which may help us in developing a solution as discussed in Rostock (i.e. with another level of representation effectively preventing the abuse of line
)?
Sorry to get back so late, but this problem seems to be a Gordic knot of sorts. Getting good real-life example data entails having some OCR which can already give these representations, which in turn entails taking action to extend the API of tesseract, ocropy or another promising candidate, for which of course a good illustrative example (and visualisation) would be invaluable.
Before I put more effort into that, please consider the following proposal for an extension of PAGE for lattices – so that we get an adequate representation of alternative segmentations without cheating.
Representing a graph in a structure-oriented XML schema like PAGE is impossible: it can only describe a tree. So one needs a pointer-oriented schema. GraphML is a popular example. For PAGE this means we should introduce:
Coords
, TextEquiv
, TextStyle
and attributes just like TextRegion
, TextLine
, Word
and Glyph
, complementing these element types, andAnd this should be possible on any level of granularity, depending on use-cases. So here it goes:
TextSegmentLattice
which may appear (0-1 times) in a TextRegion
, TextLine
, or Word
, and may contain sequences of (0-n times) bothTextSegmentNode
with an id
attribute,TextSegmentArc
with begin
and end
attributes, in addition to the attributes of the Word
element type, and which must have one Coords
, and may also contain a TextStyle
and a sequence (0-n times) of TextEquiv
The semantics of this would be straightforward. The arcs begin and end on the respective nodes, all nodes must be connected, there must not be circles etc. Note that the lattice is a terminal level – when used on some hierarchy level, it replaces the normal option available on that level, which can go deeper.
@bertsky Please provide a small XML snippet illustrating your proposal. @kba and I are very open to your suggestion.
Sure! So here is what the above (artificial) example could look like:
<TextLine ...><Coords points=.../>
<TextSegmentLattice begin="1" end="9"> <!-- instead of a sequence of Word -->
<TextSegmentNode id="1">
<TextSegmentNode id="2">
<TextSegmentArc begin="1" end="2" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>m</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="3">
<TextSegmentArc begin="1" end="3" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.75"><Unicode>n</Unicode></TextEquiv>
<TextEquiv index="1" conf="0.65"><Unicode>r</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentArc begin="3" end="2" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>i</Unicode></TextEquiv>
<TextEquiv index="1" conf="0.6"><Unicode>r</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="4">
<TextSegmentArc begin="2" end="4" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>y</Unicode></TextEquiv>
<TextEquiv index="1" conf="0.8"><Unicode>v</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="5"> <!-- explicit space: -->
<TextSegmentArc begin="4" end="5" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode> </Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="6">
<TextSegmentArc begin="5" end="6" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>p</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="7">
<TextSegmentArc begin="4" end="7" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.8"><Unicode> ,</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentArc begin="7" end="6" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>o</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="8">
<TextSegmentArc begin="6" end="8" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>a</Unicode></TextEquiv>
<TextEquiv index="0" conf="0.7"><Unicode>e</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="9">
<TextSegmentArc begin="8" end="9" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>y</Unicode></TextEquiv>
</TextSegmentArc>
</TextSegmentLattice>
<TextEquiv><Unicode>my pay</Unicode></TextEquiv>
</TextLine>
Allow me to elaborate a little.
By comparison, with the current purely hierarchical schema we had a tree of:
– with implicit white space (unless the old merge rules were to be re-activated).
Whereas with the proposed extension we would additionally get:
– with explicit white space (in TextEquiv), and possibly empty arcs (e.g. to get a single final node).
Of course, it depends on the use case what granularity level is chosen throughout documents, or even mixed within single pages.
But just what might the use cases be for this extension? As said in the opening statement, multi-OCR alignment and post-correction would benefit a lot. (Whereby post-correction could be both automatic or interactive or a combination thereof, i.e. supervised adaptation.) They are not impossible without true lattices. But the current, tree-shaped representation – which by ReadingOrder
/ textLineOrder
/ readingDirection
and XML ordering convention corresponds to a Confusion Network or sausage lattice – lacks important information about segmentation ambiguity. In addition, there will be applications like key-word spotting that depend even more on the full lattice (if not the complete confidence matrix of possible outputs, which could be accommodated by some TextSegmentMatrix
in a similar fashion).
What all those use cases have in common is that they need a representation for processing data (as opposed to GT data), they build upon the OCR search space (not the authoritative fulltext). This is of course not entirely new to PAGE, as TextEquiv lists already allow giving OCR alternatives. But it would be more rigorous with a lattice extension.
BTW, contrary to what one might expect from a first glance, this does not in fact worsen PAGE's consistency issue: As argued earlier, producing TextEquiv on multiple levels makes sense only for certain purposes:
I would like to add that there is a recent proposal in ALTO regarding segmentation ambiguity as well (albeit not lattice-based and only for the Glyph level).
I wonder if we should add options for custom extension in PAGE (like in ALTO), where you can put any custom XML elements that are exempt from validation.
See this from the ALTO schema:
<xsd:element name="XmlData" minOccurs="0">
<xsd:annotation>
<xsd:documentation xml:lang="en">
The xml data wrapper element XmlData is used to contain XML encoded metadata.
The content of an XmlData element can be in any namespace or in no namespace.
As permitted by the XML Schema Standard, the processContents attribute value for the
metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are
identified by means of an XML schemaLocation attribute, then an XML processor will validate
the elements for which it can find declarations. If a source schema is not identified, or cannot be
found at the specified schemaLocation, then an XML validator will check for well-formedness,
but otherwise skip over the elements appearing in the XmlData element.
</xsd:documentation>
</xsd:annotation>
<xsd:complexType>
<xsd:sequence>
<xsd:any namespace="##any" processContents="lax" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Thanks, @chris1010010. It is, of course, your decision, but that option would also make it impossible to enforce the new element to be a terminal alternative to line, word and glyph sequences (via xsd:choice
).
I have now made a draft version of the schema for my proposal on https://github.com/bertsky/PAGE-XML. The very last changeset adds the TextSegmentLattice
stuff, and the 3 runup changesets are a cleanup for general readability (which might be of value independently). I can make a PR if you like me to.
BTW, I have also tried to reduce redundancy and code duplication by refactoring the shared elements into xsd:group
and shared attributes into xsd:attributeGroup
– mostly those of pc:RegionType
, pc:TextRegionType
, pc:TextLineType
, pc:Word
and pc:Glyph
. (One cannot use xsd:extension
for this, because inheritance is already used for pc:RegionType
, and multiple inheritance is not allowed in XML Schema.) But this only makes sense if the xsd:sequence
requirements were to be relaxed to xsd:all
, i.e. if element types are not added in a specific order (not to be confused with element ordering). The reason is that elements would have to be shared at different positions (e.g. AlternativeImage
and Coords
more in the front, UserDefined
and Labels
towards the end, TextEquiv
and TextStyle
in between if applicable), which would render the schema illegible instead of improving clarity. Because of these difficulties, I did not publish that part (but can share it if anyone is interested).
Okay, thanks @bertsky I will discuss with Stefan Pletschacher and Apostolos Antonacopoulos. We can consider for PAGE 2019-07-15
Does anyone have more information on:
Dear all, After a lengthy discussion we decided not to add the lattice structure to the official PAGE schema. We understand the usefulness, but it does not fit well with the general purpose of PAGE to describe the content of a page (and not the intermediate OCR model). Also think the addition would make the format too complex (already very complex) for users and tools. Maybe a solution would be to store this information in a separate file? Sorry for that. Best regards, The PRImA Team
@chris1010010 Dear Christian, too bad, but thanks for detailing your reasons!
Do you still want me to separate the 3 runup changesets mentioned above (without the actual lattice extension) and prepare a pull request from them? See here for a comparision including changelog.
@kba, Do you think we can still attempt an extension within OCR-D (perhaps under a different namespace, maybe by patching the schema on the fly in core)?
@chris1010010 I now separated the lattice extension proposal from the other purely cosmetic commits and made a PR from the latter.
@kba I updated (with forced push) the lattice extension proposal, because I found a minor validation error which xmllint had not reported earlier. (Apparently, you cannot use attribute type ID
more than once per element, so begin
and end
all have to be type string
.) Do you want me to make a PR from that branch in your fork for OCR-D processing?
@bertsky Thanks, I will have a look. I should have a bit more time now (post viva). Might still be a little while though (DATeCH conference...)
@bertsky fyi, within ALTO we are currently investigating CITlab's confidence matrices (or ConfMats
) in view of possible lattice support. Are you perhaps familiar with ConfMat
and how it relates to your proposal? Anyway I'll make sure to include your proposal (assuming here is most recent?) in the discussions.
@cneud thanks for bringing this to attention. Yes, I am aware of CITlab's confmat approach/format. (In fact, I have linked to it above and mentioned the possibility to extend PAGE with a TextSegmentMatrix
in similar fashion.) Confusion/confidence matrices also allow storing the search space of an OCR which is based on RNN+CTC.
But as argued in my proposal for Tesseract (yes, this is the most recent), I believe using a lattice instead of a matrix
(As for point 3, one could probably train a sequence-to-sequence decoder for that task as well, but I am not certain of it, and it still would not work in complicated cases like Tesseract.)
Please feel free to refute any of those points! I would be more than happy to get that discussion going again. (I am sorry to have let it die earlier with Gundram and Günter.) Is the discussion for ALTO public so I can catch up with it?
@bertsky Thanks for the elaboration. Initial discussions were held in the ALTO board meeting alongside DATeCH2019 last week but as soon as we have something public, I'll post the link here and let you know.
In the OCR-D workflow, there are several steps that likely require input or output to be able to represent word segmentation ambiguity and confidence values of word boundaries (whitespace characters):
Each individual OCR engine might be able to provide such information. Postcorrection could benefit from it (importing whitespace characters with confidence as but one of many alternative characters within the input graph). Tesseract LSTM models definitely have this, at least internally (see OCR-D/ocrd_tesserocr#7 which is distinct but related). PAGE output needs to incorporate those without loss.
Character-based language models need whitespace characters both for input and for output, at least when going below the
TextLine
level: For input, because the tokenization implicit inWord
annotation is not guaranteed to be a trivial whitespace join (especially at punctuation). And for output, because the LM's whitespace probabilities would have to be thrown at (the confidence attributes of) neighbouring elements otherwise.Alignment of multiple OCR results produces symbol pairings including the empty symbol (insertion/deletion) and whitespace symbol (tokenization ambiguity) for each line. Since there is no reserved character for the empty symbol within
TextLine:TextEquiv:Unicode
– all of Unicode is possible here, except control characters, which are forbidden in CDATA –, we cannot use it to encode such alignments. We do have a natural representation for empty symbol atGlyph:TextEquiv:Unicode
, but Glyph already necessitates a strict (hierarchical) Word segmentation, which would break tokenization ambiguity again.Thus, it seems advisable to allow a PAGE annotation as interface at those particular steps which deviates from the standard in that
WordType
(by using only 1 Word per Line by convention),GlyphType
(by using whitespace within its TextEquiv).The alternative would be to either not use PAGE at all there or loose important information by design.
Example for multi-OCR alignment:
@kba @wrznr @finkf @lschiffer