altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Glyphs should allow CONTENT with length above 1 for cases where no precombined character exists #85

Open urieli opened 7 months ago

urieli commented 7 months ago

The GlyphType documentation states:

Accordingly the value for the glyph element will be defined as follows:
Pre-composed representation = base + combining character(s) (decomposed representation)
See http://www.fileformat.info/info/unicode/char/0101/index.htm
"U+0101" = (U+0061) + (U+0304)
"combining characters" ("base characters" in combination with non-spacing marks or characters which are combined to one) are represented as one "glyph", e.g. áàâ.      

This is accompanied by the restriction length=1 for the CONTENT attribute:

<xsd:attribute name="CONTENT" use="required">
    <xsd:annotation>
        <xsd:documentation>
            CONTENT contains the precomposed representation (combining character) of the character from the parent String element.
            The sequence position of the Glyph element matches the position of the character in the String.
        </xsd:documentation>
    </xsd:annotation>
    <xsd:simpleType>
        <xsd:restriction base="xsd:string">
            <xsd:length fixed="true" value="1"/>
            <xsd:whiteSpace value="preserve"/>
        </xsd:restriction>
    </xsd:simpleType>
</xsd:attribute>

Unfortunately, in some alphabets, a precomposed representation does not exist.

For example, in the Hebrew alphabet, it is possible for many letters to have three diacritics:

Even if we ignore cantillation marks, which are limited to biblical text, only a very small portion of the combined possibilities exist as precombined characters.

Thus, there is no precombined character for "בָּ" or even the more common "בָ".

Therefore, to be able to represent Hebrew glyphs properly, we should change the specification to something like:

<xsd:attribute name="CONTENT" use="required">
    <xsd:annotation>
        <xsd:documentation>
            CONTENT contains the representation of the character from the parent String element.
            Precombined characters are recommended, but it is acceptable to have one base character and zero-to-many combining diacritics.
            The sequence position of the Glyph element matches the position of the character in the String.
        </xsd:documentation>
    </xsd:annotation>
    <xsd:simpleType>
        <xsd:restriction base="xsd:string">
             <xsd:maxLength value="4" />
            <xsd:whiteSpace value="preserve"/>
        </xsd:restriction>
    </xsd:simpleType>
</xsd:attribute>

We should also remove the text above from the GlyphType documentation.

I'm not sure whether other alphabets would require more than 4 characters - maybe the max length attribute could be removed entirely.

cipriandinu commented 1 month ago

Thank you for this topic, this change could be a good candidate for 5.0 as well, maybe we will find other use cases (other languages) to provide it as well as sample of usage

cipriandinu commented 1 month ago

Similar topic is already opened: #44