altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Make it possible to distinguish hard and soft hyphens #86

Open urieli opened 8 months ago

urieli commented 8 months ago

Currently there is no way of distinguishing hard and soft HYP elements.

Example of a hard hyphen:

I separated the words by a non-
breaking space.

Example of a soft hyphen:

I separated the words by a non-break-
ing space.

However, since the OCR system can often distinguish the two (e.g. by checking a lexicon of known words), it should be able to pass this information to downstream systems in the Alto file, since this information could affect OCR-to-text and OCR layer indexing strategies.

I suggest changing the HYP element to include a new HARD_HYPHEN attribute, as follows:

<xsd:element name="HYP" minOccurs="0">
  <xsd:annotation>
    <xsd:documentation>A hyphenation char. Can appear only at the end of a line.</xsd:documentation>
  </xsd:annotation>
  <xsd:complexType>
    <xsd:attribute name="HEIGHT" type="xsd:float" use="optional"/>
    <xsd:attribute name="WIDTH" type="xsd:float" use="optional"/>
    <xsd:attribute name="HPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="VPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="CONTENT" type="xsd:string" use="required"/>
    <xsd:attribute name="HARD_HYPHEN" type="xsd:boolean" use="optional">
      <xsd:annotation>
        <xsd:documentation>True if this is a hard-hyphen (would appear in the word regardless of print location), false if this is a soft hyphen (only appears in the word if it is split at the end of a line).</xsd:documentation>
      </xsd:annotation>
    </xsd:attribute>
  </xsd:complexType>
</xsd:element>
cipriandinu commented 4 months ago

Thank you for your proposal, we will discuss it and take it into account for the next release (5.0)

cipriandinu commented 3 months ago

Maybe this should be discussed in a larger context, see #43