altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Direction, orientation, and reading order (text direction elements) #74

Closed mittagessen closed 2 years ago

mittagessen commented 3 years ago

This pull requests bundles multiple backward compatible changes to the schema that resolve issues related to line orientation, direction, and reading order. While I would usually split them into separate PRs, they've been discussed jointly in the past (see https://github.com/altoxml/schema/issues/12#issuecomment-113184844) and the addressed deficiencies are somewhat complementary.

Principal inline text direction

The first part of the proposal takes up #12 and #73. It adds an attribute BASEDIRECTION on the *Block and TextLine elements which indicates the base text direction of the lines/text contained therein ((ltr|rtl|ttb|btt). This is helpful not only for rendering purposes of many East Asian scripts that can be written both vertically and horizontally but also to correctly set the base text direction of the BiDi algorithm during processing.

Example of the use of this new attribute:

...
<TextBlock ID="block_0" ... BASEDIRECTION='ltr'>
<TextLine ID="line_0" ....>....</TextLine>
<TextLine ID="line_1"....>...</TextLine>
<TextLine ID="line_2" BASEDIRECTION="rtl">...</TextLine>
</TextBlock>

Different settings on lower levels of the hierarchy override those inherited from higher levels.

Reading Order

This part is a fairly truthful adaptation of the example in #18 which in turn derives from PageXML. Some changes are made to allow the encoding of more complex historical documents and the serialization of multiple reading orders. The two principal changes are:

In addition, elements in the reading order can be tagged with TAGREFS to indicate roles of a particular elements in a reading order (such as an addition, correction or a particle that is the continuation of text on another line).

A single reading order with roles:

...
<RoleTag ID="type_0" LABEL="correction"/>
</Tags>
<ReadingOrder>
   <OrderedGroup ID="main_0">
        <ElementRef ID="o_0" REF="block_0"/>
        <ElementRef ID="o_1" REF="block_10">
        <ElementRef ID="o_2" REF="string_25"/>
        <ElementRef ID="o_2" REF="line_10" TAGSREFS="type_0"/>
        <ElementRef ID="o_2" REF="string_26"/>
        <ElementRef ID="o_3" REF="block_2"/>
   </OrderedGroup>
</ReadingOrder>
<Layout>
...

Multiple reading orders can be encoded through the nesting of unordered and ordered groups:

...
<OtherTag ID="type_1" LABEL="A valid, complete reading order"/>
</Tags>
<ReadingOrder>
   <UnorderedGroup ID="valid_orders">
       <OrderedGroup ID="order_0" TAGREFS="type_1">
           <ElementRef .../>
           <UnorderedGroup ...>
           ....
           </UnorderedGroup>
           <ElementRef .../>
       </OrderedGroup>
       <OrderedGroup ID="order_1" TAGREFS="type_1">
       ...
       </OrderedGroup>
       ....
   </UnorderedGroup>
</ReadingOrder>
...

While this can potentially result in ambiguity in the absence of a taxonomy of well-defined roles of elements in the reading order I'd like to avoid specifying this until a later point in time as it requires substantial input from users of the standard, especially those with more esoteric material. As can be seen in the last examples groups can also be arbitrarily nested.


EDIT: Removed explicit indices as attributes and the corresponding indexed types.

cneud commented 3 years ago

Awesome contribution, many thanks! Would you perhaps be interested to present and discuss these at the next ALTO board meeting?

mittagessen commented 3 years ago

Sure. I assume it's sometime after the summer holidays? We can probably also find some examples of the more unusual pages we'd like to be able to encode.

cneud commented 3 years ago

I assume it's sometime after the summer holidays?

I don't think we have fixed the date yet but it will probably be 1st or 2nd week of September, will let you know!

Example pages are certainly very welcome too.

artunit commented 3 years ago

This looks fantastic. We have a general ALTO Board meeting this week but this seems worthy of a single-topic gathering. Maybe the 2nd week of September? We tend to gravitate towards Thursdays, so tentatively 2021-09-09 (9-10:30 am EST) but we can be flexible on this. Some examples of unusual pages would be great as well!

bertsky commented 3 years ago

Interesting. This introduces PAGE-XML concepts in a radical way (along with their semantic problems). It would be great to have that kind of flexibility in ALTO (multiple RO, labels, independence of semantic and element ordering) IMO. Just a few comments/questions:

mittagessen commented 3 years ago

I'll try to answer one by one.

bertsky commented 3 years ago

Thanks for elaborating, just a few follow-ups:

  • Why reference elements below block level at all in RO? Since they also get an ordering attribute of their own here (becoming independent of element ordering), would that not be redundant? That is: what if @BASEDIRECTION and ReadingOrder (and element ordering) clash?
  • Ordering below block level is somewhat crucial for many complex texts that have elements which cannot reasonably belong to the same topological text block 'inserted' into the reading order. There are marginal insertions, notes, apparatus criticus, etc. which are located outside of the current text block but are read between elements inside the block.

I see. Indeed, for that purpose PAGE-XML's "flat" ReadingOrder + @textLineOrder is not enough, you do need a general "onto" mapping. (On the other hand, nothing syntactially prevents you from using TextLine/@ID or Word/@ID for @regionRef in PAGE-XML already – they are mere xs:IDREF, only documentation currently says they are meant for regions alone.)

But couldn't we in theory always decompose blocks/regions recursively (into single-line regions if necessary) to achieve the same thing without sub-block refs? (Just wondering how to best read PAGE-XML's intended representation.)

@BASEDIRECTION and ReadingOrder are completely independent, in fact I believe the purpose of @BASEDIRECTION (and readingDirection in PageXML but the docs are rather mute on this point) is/should solely be to indicate the base direction parameter of the Unicode BiDi algorithm and potentially rotation for display purposes (rotating into the horizontal/vertical depending on ltr/rtl and ttb/btt). Somewhat related to this is Clarify implicit reading order #68; by enforcing the (implicit) order of elements below a TextLine to be the logical order (in the sense of the BiDi algorithm) we can still extract the text in correct order for simple visual display or computation while at the same time inserting non-line elements into the reading order for more advanced viewers. This document makes the different purposes of these ordering elements clearer.

Sorry, I misunderstood InlineDirType to denote something like @textLineOrder on the line block level. But your documentation already states the two levels are merely for inheritance. (Also, I had not given display/digital rendering much thought.)

  • Since in ALTO all structural elements below the block level have merely optional @ID: Wouldn't it be preferable to start requiring them everywhere?
  • We could make them mandatory but this would break forward compatibility of existing documents. While schema versioning in theory should prevent this in practice people write ad-hoc parsers, so I'm a bit wary of introducing changes like this.

Agreed.

(Or, conversely, why strictly require group elements to have an @ID of their own?)

  • I never understood why PAGE-XML decided to need @index for ordered groups, and cast the ordered/unordered distinction into a 2x3 matrix of un/ordered.../indexed types. Why not use the element ordering of the refs itself for ordered groups, and represent the difference between ordered and unordered by a simple boolean @ordered?
  • That's a personal preference. The indices are admittedly only in there to not deviate too much from Page.

The deviation would merely be syntactical though. (And the syntactic candy here does weigh heavy.) The actual semantic deviation is regarding sub-region refs (but see above).

BTW, PRImA's own implementation so far does not even respect the indices (but uses implicit ordering solely).

  • Another issue I have is the relationship of this new RO mechanism to the existing @IDNEXT mechanism, which are now redundant to some degree: What if they both are used, and what if they clash?
  • The easiest way would be to disallow one when the other is present. I'm not proficient enough with XSD to know how one would encode this.

It's not possible by schema AFAIK, but one could add documentation stating that any @IDNEXT is to be ignored if ReadingOrder is present…

  • Regarding @BASEDIRECTION, am I right in assuming it would practically only make sense to have block and line level use orthogonal values, i.e. rtl/ltr in a ttb/btt or ttb/btt in a rtl/ltr? How is this enforced by the schema?
  • It should be allowed in any combination which hopefully makes sense given the BiDi comment above. In any case, I'm loath to prohibit redundant encodings. They are often easier to serialize/deserialize than more compact encodings while not allowing these doesn't offer any benefits.

Yes. (The question followed from my misunderstanding. I'm not worried about the cost of redundancy here. And functionally, in a DOM you can always fully expand the inheritance.)

  • Also, since these are absolute notions, what is the relationship to @ROTATION (which could apply to each block differently)? Should we read this as applying before or after deskewing? What is your point of reference for absolute terms like top and bottom, left and right when you have non-orthogonal @ROTATION – does the interpretation of "left" snap from one side to the other as the angle crosses 45°?
  • Top to bottom, bottom to top, left to right, and right to left are line-relative and abstract notions and not absolute with regard to page orientation. Rotation is mostly independent of that. Every single line of this manuscript page would be rtl (or ltr if the text transcription was produced in an environment with the BiDi algorithm base direction set to ltr) but the @ROTATION could be anything. @ROTATION only comes into play when deciding how to extract the line image for visual display (ttb/btt lines should be rectified/rotated to be vertical, ltr/rtl lines to be horizontal) but it isn't well enough specified to be useful for that as it isn't clear relative to which axis the rotational angle is. In any case, @BASELINE is much more powerful as it allows rectifying arbitrarily curved and rotated lines; at least for manuscripts where line angle and curvature tends to change inside a text block @ROTATION is woefully inadequate.

Thanks for clarifying! (So perhaps I also misunderstood these in PAGE-XML, where they might also be meant relative to the baseline?)

Regarding rotational axis, I do think this is specified clearly in ALTO-XML (see discussion here): axis runs through the center of the block (HPOS+0.5*WIDTH, VPOS+0.5*HEIGHT).

Indeed, @BASELINE is much more precise, but is itself not rich enough to automatically extract masked line images, for which in my understanding only the polygonal hull of the glyphs (i.e. TextLine/Shape/Polygon) would be adequate.

mittagessen commented 3 years ago

But couldn't we in theory always decompose blocks/regions recursively (into single-line regions if necessary) to achieve the same thing without sub-block refs? (Just wondering how to best read PAGE-XML's intended representation.)

Yes you could decompose it like this but you're losing some of the semantics of TextLine or lower level elements.

It's not possible by schema AFAIK, but one could add documentation stating that any @IDNEXT is to be ignored if ReadingOrder is present…

Yeah, I'm not sure how to do this well. AFAIK there's no good document introducing the standard and the schema comments are a bit lacking a lot of the time. We should probably get around to write down the semantics of most constructs a bit more explicitly.

Thanks for clarifying! (So perhaps I also misunderstood these in PAGE-XML, where they might also be meant relative to the baseline?)

Almost certainly. It doesn't really make sense otherwise.

Regarding rotational axis, I do think this is specified clearly in ALTO-XML (see discussion here): axis runs through the center of the block (HPOS+0.5*WIDTH, VPOS+0.5*HEIGHT).

I'm mostly talking about the 'target' rotation. Does a perfectly vertical ttb/btt line have a rotation of 90°/270° or 0°?

Indeed, @BASELINE is much more precise, but is itself not rich enough to automatically extract masked line images, for which in my understanding only the polygonal hull of the glyphs (i.e. TextLine/Shape/Polygon) would be adequate.

Of course. You actually need both to rotate a line correctly into the plane as the polygonal boundary can be deceiving when curvature and messy or differently sized letters come in combination.

mittagessen commented 3 years ago

@cneud @artunit Can we get this discussed at the next board meeting? I've missed the on in September but can definitely prepare something for the next one.

bertsky commented 3 years ago

I'm mostly talking about the 'target' rotation. Does a perfectly vertical ttb/btt line have a rotation of 90°/270° or 0°?

I would argue for the latter, because 90/270/left/right is different from vertical writing. So @ROTATION and @orientation are catch-alls for skew and 90° multiples, while the other attributes are truly ordering relations. (The fact that vertical script is trained horizontally and thus glyphs are not upwards when they enter the OCR engine should not be relevant here.)

artunit commented 3 years ago

I have been terribly disconnected lately but am happy to try to align this discussion with a Board meeting. @mittagessen, @bertsky, @cneud - would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting date/time for you? @cipriandinu - would that work for you as well? We could consider an earlier date if it works for @cipriandinu, I just can't guarantee a network connection until that point.

mittagessen commented 3 years ago

Unfortunately I'm teaching during those exact hours. Otherwise my November is still free though, so any other date (or even on 18/11 in the afternoon) would work.

cipriandinu commented 3 years ago

Hi Art,

For me is ok 18th of November, or earlier

Best, Cip

@.***

Ciprian Dinu Managing Director (CCS Romania)

CCS Content Conversion Specialists ROM SRL Calea Grivitei nr. 143 | 010708 Bucharest | Romania Phone +40 21 31 079 69 | Fax +40 21 31 079 69 Mobile +40 723 297 127 @.**@.> | www.ccs-romania.rohttp://www.ccs-romania.ro/

P Be nice to the world. Please don't print this e-mail unless you really need to.

The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message. Thank you.

From: artunit @.> Sent: Friday, October 8, 2021 11:30 PM To: altoxml/schema @.> Cc: Ciprian Dinu @.>; Mention @.> Subject: Re: [altoxml/schema] Direction, orientation, and reading order (text direction elements) (#74)

I have been terribly disconnected lately but am happy to try to align this discussion with a Board meeting. @mittagessenhttps://github.com/mittagessen, @bertskyhttps://github.com/bertsky, @cneudhttps://github.com/cneud - would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting date/time for you? @cipriandinuhttps://github.com/cipriandinu - would that work for you as well? We could consider an earlier date if it works for @cipriandinuhttps://github.com/cipriandinu, I just can't guarantee a network connection until that point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/altoxml/schema/pull/74#issuecomment-939098802, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANLNLFMYN47BHU3DZVEOUZLUF5ICZANCNFSM5BGLCM3Q. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

artunit commented 3 years ago

Apologies, I will follow up by email instead of overloading the issue thread.

bertsky commented 3 years ago

Sry, did not see this earlier: I wonder if OrderedGroupType and UnorderedGroupType should also get a @REF (as they do in PAGE-XML). Without this, you'd need to add one additional ElementRefType into each group – but you would need to construct the order hierarchy differently than in PAGE-XML (i.e. graphs would have to be transformed when converting).

Also, IMHO the formulation for @BASEDIRECTIONIndicates the inline base direction – and InlineDirTypeDescribes the base direction of text inside a line or of all lines inside a text block. – can still be improved.

mittagessen commented 2 years ago

I've created some examples on how to use these extensions: alto_ro_examples.