Closed mittagessen closed 2 years ago
Thank you Ben for chiming in and raising the issue here! I saw the tweets but you beat me to it ;)
There was some discussion on making this more explicit previously, the outcomes of which are captured here https://github.com/altoxml/schema/issues/12#issuecomment-113184844, but eventually no changes to the schema were made. Maybe now is a good time to revisit this.
Yes, I'm aware of previous discussions regarding reading order but these require a larger overhaul of the object ordering notation as quite a few documents would need multiple reading orders and other 'advanced' features.
This is mostly about putting an explanatory note in some supplementary material that textual elements are not some amorphous cloud but their sequence represents a valid text flow, i.e. clarifying how all but one software currently serializes into ALTO. No actual changes to the schema needed.
Adding a link to issue 69 (Confidence value for Layout detection of elements) here.
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
ACCEPT
Published in v4.3
There are currently no mentions of reading order anywhere in the standard and most people treat the sequence of elements as the order these elements should be read, e.g. the n-th
<String>
in a<TextLine>
is the n-th word a human reader would read in that line.Apparently this isn't evident to everyone out there. These tweets document that Transkribus's ALTO output sorts
<String>
elements from left to right which causes an inversion for RTL text. We should probably clarify that<TextLine>/<String>/<Glyph>
are to be ordered in a way that corresponds to the text flow.