altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Non Linear Hyphens #41

Open jpmoreux opened 8 years ago

jpmoreux commented 8 years ago

Describing hyphen running on 2 pages or between main text flow and footnotes block is undeterministic.

Example: left page: one hyphen in last footnote: "Victor-" right page: one hyphen in main text flow ("Font-") and 2nd part of page 194 hyphen ("Hugo")

In this example, ALTO markup could let one think that String "Font-" is the first part of the hyphen (HypPart1), and String "Hugo" the second part (HypPart2). In such a case, a validation tool on hyphens consistency will fail at doing its job.

These ALTO files were produced during an EPUB+ALTO digitization program. EPUB format needs to identify footnotes and consequently, export of hyphens in ALTO files are logically correct but "unclear" in the ALTO "context".

...
<String ID="PAG_00000213_ST000193" CONTENT="Fon-" HEIGHT="44" HPOS="1335" STYLEREFS="TXT_14" SUBS_CONTENT="Fontanes" SUBS_TYPE="HypPart1" VPOS="2214" WC="1" WIDTH="100"/>
<HYP CONTENT="-" HPOS="1435" VPOS="2214" WIDTH="26"/>
</TextLine>
</TextBlock>
<TextBlock ID="PAG_00000213_TB000010" HEIGHT="156" HPOS="224" STYLEREFS="TXT_77" VPOS="2394" WIDTH="1236" language="FR">
<TextLine ID="PAG_00000213_TL000025" BASELINE="2431" HEIGHT="48" HPOS="224" VPOS="2394" WIDTH="1235">
<String ID="PAG_00000213_ST000194" CONTENT="Hugo" HEIGHT="45" HPOS="224" STYLEREFS="TXT_7" SUBS_CONTENT="Victor-Hugo" SUBS_TYPE="HypPart2" VPOS="2394" WC="0.983" WIDTH="110"/>

212

213

artunit commented 7 years ago

The SUBS_CONTENT attribute does distinguish between "Fontanes" and "Victor-Hugo", would checking SUBS_CONTENT be an option for a validation tool?