Open splet opened 7 years ago
This was discussed on the technical sessions and I think is also explained by the statement of Jean-Philip, that the main glyph should be the one sign and should be limited to 1 to prevent misusage / wrong interpretation for having multiple characters bound to one glyph and then having all kind of possible combinations for the alternatives. See https://github.com/altoxml/schema/issues/26#issuecomment-200405659
In an effort to keep ahead of schema issues, ones without a direct schema implication will be closed if deemed to be no longer active or if the discussion has gone full circle. They can be reopened if requested.
The change proposed by @Jo-CCS and adopted into 4.0-4.2 includes this detail of restricting "character" length that seems overly restrictive to me, not just with respect to OCR results, but on principal grounds: In some languages / scripts, not all relevant characters can be represented by a single Unicode codepoint (not to be confused with Glyph
or grapheme cluster), but that's what the schema enforces:
Scripts like Arabic, Hebrew, Devanagari and Bengali heavily rely on combining mark sequences, and even for European languages (esp. in historic texts) there's not always a precomposed codepoint available. For example, German umlauts äöü
cannot only be decomposed as äöü
(with combining trema), but also as aͤoͤuͤ
(with combining e
). Same with other rare diacritics. One could argue the same for fractions, where only a few like ¾ ⅔
are available precomposed, the others need to be decomposed 3⁄4 2⁄3
.
Please re-open.
I agree, this should be re-opened. Some glyphs we have in historic prints, like aͤ (LATIN SMALL LETTER A + COMBINING SMALL LETTER E) cannot be represented in a single Unicode code point and the cited XML Schema restriction does not allow us to save them in a valid ALTO document.
Thanks for the comments, this issue is reopened.
Separated from https://github.com/altoxml/schema/issues/26#issuecomment-256652798 Glyph variants: The main glyphs are restricted to length 1 but variants to length 3. This could be a bit inconvenient when dealing with OCR results. Say FineReader returns 5 options, some with length 1 and some longer. What happens if the first one is not of length 1, does the ALTO exporter tool then check if there is one with length 1 among the other options and change the order? And why three? For Latin that would probably cover most cases, but for other scripts there might be longer ones.