Closed mittagessen closed 2 years ago
Awesome contribution, many thanks! Would you perhaps be interested to present and discuss these at the next ALTO board meeting?
Sure. I assume it's sometime after the summer holidays? We can probably also find some examples of the more unusual pages we'd like to be able to encode.
I assume it's sometime after the summer holidays?
I don't think we have fixed the date yet but it will probably be 1st or 2nd week of September, will let you know!
Example pages are certainly very welcome too.
This looks fantastic. We have a general ALTO Board meeting this week but this seems worthy of a single-topic gathering. Maybe the 2nd week of September? We tend to gravitate towards Thursdays, so tentatively 2021-09-09 (9-10:30 am EST) but we can be flexible on this. Some examples of unusual pages would be great as well!
Interesting. This introduces PAGE-XML concepts in a radical way (along with their semantic problems). It would be great to have that kind of flexibility in ALTO (multiple RO, labels, independence of semantic and element ordering) IMO. Just a few comments/questions:
@BASEDIRECTION
and ReadingOrder
(and element ordering) clash?@ID
: Wouldn't it be preferable to start requiring them everywhere? (Or, conversely, why strictly require group elements to have an @ID
of their own?)@index
for ordered groups, and cast the ordered/unordered distinction into a 2x3 matrix of un/ordered.../indexed types. Why not use the element ordering of the refs itself for ordered groups, and represent the difference between ordered and unordered by a simple boolean @ordered
?@IDNEXT
mechanism, which are now redundant to some degree: What if they both are used, and what if they clash?@BASEDIRECTION
, am I right in assuming it would practically only make sense to have block and line level use orthogonal values, i.e. rtl/ltr in a ttb/btt or ttb/btt in a rtl/ltr? How is this enforced by the schema? @ROTATION
(which could apply to each block differently)? Should we read this as applying before or after deskewing? What is your point of reference for absolute terms like top and bottom, left and right when you have non-orthogonal @ROTATION
– does the interpretation of "left" snap from one side to the other as the angle crosses 45°?I'll try to answer one by one.
@BASEDIRECTION
and ReadingOrder
are completely independent, in fact I believe the purpose of @BASEDIRECTION
(and readingDirection
in PageXML but the docs are rather mute on this point) is/should solely be to indicate the base direction parameter of the Unicode BiDi algorithm and potentially rotation for display purposes (rotating into the horizontal/vertical depending on ltr/rtl
and ttb/btt
). Somewhat related to this is #68; by enforcing the (implicit) order of elements below a TextLine
to be the logical order (in the sense of the BiDi algorithm) we can still extract the text in correct order for simple visual display or computation while at the same time inserting non-line elements into the reading order for more advanced viewers. This document makes the different purposes of these ordering elements clearer.rtl
(or ltr
if the text transcription was produced in an environment with the BiDi algorithm base direction set to ltr
) but the @ROTATION
could be anything.
@ROTATION
only comes into play when deciding how to extract the line image for visual display (ttb/btt
lines should be rectified/rotated to be vertical, ltr/rtl
lines to be horizontal) but it isn't well enough specified to be useful for that as it isn't clear relative to which axis the rotational angle is. In any case, @BASELINE
is much more powerful as it allows rectifying arbitrarily curved and rotated lines; at least for manuscripts where line angle and curvature tends to change inside a text block @ROTATION
is woefully inadequate. Thanks for elaborating, just a few follow-ups:
- Why reference elements below block level at all in RO? Since they also get an ordering attribute of their own here (becoming independent of element ordering), would that not be redundant? That is: what if
@BASEDIRECTION
andReadingOrder
(and element ordering) clash?- Ordering below block level is somewhat crucial for many complex texts that have elements which cannot reasonably belong to the same topological text block 'inserted' into the reading order. There are marginal insertions, notes, apparatus criticus, etc. which are located outside of the current text block but are read between elements inside the block.
I see. Indeed, for that purpose PAGE-XML's "flat" ReadingOrder
+ @textLineOrder
is not enough, you do need a general "onto" mapping. (On the other hand, nothing syntactially prevents you from using TextLine/@ID
or Word/@ID
for @regionRef
in PAGE-XML already – they are mere xs:IDREF
, only documentation currently says they are meant for regions alone.)
But couldn't we in theory always decompose blocks/regions recursively (into single-line regions if necessary) to achieve the same thing without sub-block refs? (Just wondering how to best read PAGE-XML's intended representation.)
@BASEDIRECTION
andReadingOrder
are completely independent, in fact I believe the purpose of@BASEDIRECTION
(andreadingDirection
in PageXML but the docs are rather mute on this point) is/should solely be to indicate the base direction parameter of the Unicode BiDi algorithm and potentially rotation for display purposes (rotating into the horizontal/vertical depending onltr/rtl
andttb/btt
). Somewhat related to this is Clarify implicit reading order #68; by enforcing the (implicit) order of elements below aTextLine
to be the logical order (in the sense of the BiDi algorithm) we can still extract the text in correct order for simple visual display or computation while at the same time inserting non-line elements into the reading order for more advanced viewers. This document makes the different purposes of these ordering elements clearer.
Sorry, I misunderstood InlineDirType
to denote something like @textLineOrder
on the line block level. But your documentation already states the two levels are merely for inheritance. (Also, I had not given display/digital rendering much thought.)
- Since in ALTO all structural elements below the block level have merely optional
@ID
: Wouldn't it be preferable to start requiring them everywhere?- We could make them mandatory but this would break forward compatibility of existing documents. While schema versioning in theory should prevent this in practice people write ad-hoc parsers, so I'm a bit wary of introducing changes like this.
Agreed.
(Or, conversely, why strictly require group elements to have an
@ID
of their own?)
- I never understood why PAGE-XML decided to need
@index
for ordered groups, and cast the ordered/unordered distinction into a 2x3 matrix of un/ordered.../indexed types. Why not use the element ordering of the refs itself for ordered groups, and represent the difference between ordered and unordered by a simple boolean@ordered
?- That's a personal preference. The indices are admittedly only in there to not deviate too much from Page.
The deviation would merely be syntactical though. (And the syntactic candy here does weigh heavy.) The actual semantic deviation is regarding sub-region refs (but see above).
BTW, PRImA's own implementation so far does not even respect the indices (but uses implicit ordering solely).
- Another issue I have is the relationship of this new RO mechanism to the existing
@IDNEXT
mechanism, which are now redundant to some degree: What if they both are used, and what if they clash?- The easiest way would be to disallow one when the other is present. I'm not proficient enough with XSD to know how one would encode this.
It's not possible by schema AFAIK, but one could add documentation stating that any @IDNEXT
is to be ignored if ReadingOrder
is present…
- Regarding
@BASEDIRECTION
, am I right in assuming it would practically only make sense to have block and line level use orthogonal values, i.e. rtl/ltr in a ttb/btt or ttb/btt in a rtl/ltr? How is this enforced by the schema?- It should be allowed in any combination which hopefully makes sense given the BiDi comment above. In any case, I'm loath to prohibit redundant encodings. They are often easier to serialize/deserialize than more compact encodings while not allowing these doesn't offer any benefits.
Yes. (The question followed from my misunderstanding. I'm not worried about the cost of redundancy here. And functionally, in a DOM you can always fully expand the inheritance.)
- Also, since these are absolute notions, what is the relationship to
@ROTATION
(which could apply to each block differently)? Should we read this as applying before or after deskewing? What is your point of reference for absolute terms like top and bottom, left and right when you have non-orthogonal@ROTATION
– does the interpretation of "left" snap from one side to the other as the angle crosses 45°?- Top to bottom, bottom to top, left to right, and right to left are line-relative and abstract notions and not absolute with regard to page orientation. Rotation is mostly independent of that. Every single line of this manuscript page would be
rtl
(orltr
if the text transcription was produced in an environment with the BiDi algorithm base direction set toltr
) but the@ROTATION
could be anything.@ROTATION
only comes into play when deciding how to extract the line image for visual display (ttb/btt
lines should be rectified/rotated to be vertical,ltr/rtl
lines to be horizontal) but it isn't well enough specified to be useful for that as it isn't clear relative to which axis the rotational angle is. In any case,@BASELINE
is much more powerful as it allows rectifying arbitrarily curved and rotated lines; at least for manuscripts where line angle and curvature tends to change inside a text block@ROTATION
is woefully inadequate.
Thanks for clarifying! (So perhaps I also misunderstood these in PAGE-XML, where they might also be meant relative to the baseline?)
Regarding rotational axis, I do think this is specified clearly in ALTO-XML (see discussion here): axis runs through the center of the block (HPOS+0.5*WIDTH, VPOS+0.5*HEIGHT
).
Indeed, @BASELINE
is much more precise, but is itself not rich enough to automatically extract masked line images, for which in my understanding only the polygonal hull of the glyphs (i.e. TextLine/Shape/Polygon
) would be adequate.
But couldn't we in theory always decompose blocks/regions recursively (into single-line regions if necessary) to achieve the same thing without sub-block refs? (Just wondering how to best read PAGE-XML's intended representation.)
Yes you could decompose it like this but you're losing some of the semantics of TextLine
or lower level elements.
It's not possible by schema AFAIK, but one could add documentation stating that any
@IDNEXT
is to be ignored ifReadingOrder
is present…
Yeah, I'm not sure how to do this well. AFAIK there's no good document introducing the standard and the schema comments are a bit lacking a lot of the time. We should probably get around to write down the semantics of most constructs a bit more explicitly.
Thanks for clarifying! (So perhaps I also misunderstood these in PAGE-XML, where they might also be meant relative to the baseline?)
Almost certainly. It doesn't really make sense otherwise.
Regarding rotational axis, I do think this is specified clearly in ALTO-XML (see discussion here): axis runs through the center of the block (
HPOS+0.5*WIDTH, VPOS+0.5*HEIGHT
).
I'm mostly talking about the 'target' rotation. Does a perfectly vertical ttb/btt
line have a rotation of 90°/270° or 0°?
Indeed,
@BASELINE
is much more precise, but is itself not rich enough to automatically extract masked line images, for which in my understanding only the polygonal hull of the glyphs (i.e.TextLine/Shape/Polygon
) would be adequate.
Of course. You actually need both to rotate a line correctly into the plane as the polygonal boundary can be deceiving when curvature and messy or differently sized letters come in combination.
@cneud @artunit Can we get this discussed at the next board meeting? I've missed the on in September but can definitely prepare something for the next one.
I'm mostly talking about the 'target' rotation. Does a perfectly vertical
ttb/btt
line have a rotation of 90°/270° or 0°?
I would argue for the latter, because 90/270/left/right is different from vertical writing. So @ROTATION
and @orientation
are catch-alls for skew and 90° multiples, while the other attributes are truly ordering relations. (The fact that vertical script is trained horizontally and thus glyphs are not upwards when they enter the OCR engine should not be relevant here.)
I have been terribly disconnected lately but am happy to try to align this discussion with a Board meeting. @mittagessen, @bertsky, @cneud - would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting date/time for you? @cipriandinu - would that work for you as well? We could consider an earlier date if it works for @cipriandinu, I just can't guarantee a network connection until that point.
I have been terribly disconnected lately but am happy to try to align this discussion with a Board meeting. @mittagessen, @bertsky, @cneud
- would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting date/time for you?
Unfortunately I'm teaching during those exact hours. Otherwise my November is still free though, so any other date (or even on 18/11 in the afternoon) would work.
Hi Art,
For me is ok 18th of November, or earlier
Best, Cip
@.***
Ciprian Dinu Managing Director (CCS Romania)
CCS Content Conversion Specialists ROM SRL Calea Grivitei nr. 143 | 010708 Bucharest | Romania Phone +40 21 31 079 69 | Fax +40 21 31 079 69 Mobile +40 723 297 127 @.**@.> | www.ccs-romania.rohttp://www.ccs-romania.ro/
P Be nice to the world. Please don't print this e-mail unless you really need to.
The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message. Thank you.
From: artunit @.> Sent: Friday, October 8, 2021 11:30 PM To: altoxml/schema @.> Cc: Ciprian Dinu @.>; Mention @.> Subject: Re: [altoxml/schema] Direction, orientation, and reading order (text direction elements) (#74)
I have been terribly disconnected lately but am happy to try to align this discussion with a Board meeting. @mittagessenhttps://github.com/mittagessen, @bertskyhttps://github.com/bertsky, @cneudhttps://github.com/cneud - would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting date/time for you? @cipriandinuhttps://github.com/cipriandinu - would that work for you as well? We could consider an earlier date if it works for @cipriandinuhttps://github.com/cipriandinu, I just can't guarantee a network connection until that point.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/altoxml/schema/pull/74#issuecomment-939098802, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANLNLFMYN47BHU3DZVEOUZLUF5ICZANCNFSM5BGLCM3Q. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Apologies, I will follow up by email instead of overloading the issue thread.
Sry, did not see this earlier: I wonder if OrderedGroupType
and UnorderedGroupType
should also get a @REF
(as they do in PAGE-XML). Without this, you'd need to add one additional ElementRefType
into each group – but you would need to construct the order hierarchy differently than in PAGE-XML (i.e. graphs would have to be transformed when converting).
Also, IMHO the formulation for @BASEDIRECTION
– Indicates the inline base direction – and InlineDirType
– Describes the base direction of text inside a line or of all lines inside a text block. – can still be improved.
I've created some examples on how to use these extensions: alto_ro_examples.
This pull requests bundles multiple backward compatible changes to the schema that resolve issues related to line orientation, direction, and reading order. While I would usually split them into separate PRs, they've been discussed jointly in the past (see https://github.com/altoxml/schema/issues/12#issuecomment-113184844) and the addressed deficiencies are somewhat complementary.
Principal inline text direction
The first part of the proposal takes up #12 and #73. It adds an attribute
BASEDIRECTION
on the*Block
andTextLine
elements which indicates the base text direction of the lines/text contained therein ((ltr|rtl|ttb|btt
). This is helpful not only for rendering purposes of many East Asian scripts that can be written both vertically and horizontally but also to correctly set the base text direction of the BiDi algorithm during processing.Example of the use of this new attribute:
Different settings on lower levels of the hierarchy override those inherited from higher levels.
Reading Order
This part is a fairly truthful adaptation of the example in #18 which in turn derives from PageXML. Some changes are made to allow the encoding of more complex historical documents and the serialization of multiple reading orders. The two principal changes are:
TextBlock
such asTextLine
,String
, andGlyph
is possible.In addition, elements in the reading order can be tagged with
TAGREFS
to indicate roles of a particular elements in a reading order (such as an addition, correction or a particle that is the continuation of text on another line).A single reading order with roles:
Multiple reading orders can be encoded through the nesting of unordered and ordered groups:
While this can potentially result in ambiguity in the absence of a taxonomy of well-defined roles of elements in the reading order I'd like to avoid specifying this until a later point in time as it requires substantial input from users of the standard, especially those with more esoteric material. As can be seen in the last examples groups can also be arbitrarily nested.
EDIT: Removed explicit indices as attributes and the corresponding indexed types.