Currently there is no way of distinguishing hard and soft HYP elements.
Example of a hard hyphen:
I separated the words by a non-
breaking space.
Example of a soft hyphen:
I separated the words by a non-break-
ing space.
However, since the OCR system can often distinguish the two (e.g. by checking a lexicon of known words), it should be able to pass this information to downstream systems in the Alto file, since this information could affect OCR-to-text and OCR layer indexing strategies.
I suggest changing the HYP element to include a new HARD_HYPHEN attribute, as follows:
<xsd:element name="HYP" minOccurs="0">
<xsd:annotation>
<xsd:documentation>A hyphenation char. Can appear only at the end of a line.</xsd:documentation>
</xsd:annotation>
<xsd:complexType>
<xsd:attribute name="HEIGHT" type="xsd:float" use="optional"/>
<xsd:attribute name="WIDTH" type="xsd:float" use="optional"/>
<xsd:attribute name="HPOS" type="xsd:float" use="optional"/>
<xsd:attribute name="VPOS" type="xsd:float" use="optional"/>
<xsd:attribute name="CONTENT" type="xsd:string" use="required"/>
<xsd:attribute name="HARD_HYPHEN" type="xsd:boolean" use="optional">
<xsd:annotation>
<xsd:documentation>True if this is a hard-hyphen (would appear in the word regardless of print location), false if this is a soft hyphen (only appears in the word if it is split at the end of a line).</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:complexType>
</xsd:element>
Currently there is no way of distinguishing hard and soft
HYP
elements.Example of a hard hyphen:
Example of a soft hyphen:
However, since the OCR system can often distinguish the two (e.g. by checking a lexicon of known words), it should be able to pass this information to downstream systems in the Alto file, since this information could affect OCR-to-text and OCR layer indexing strategies.
I suggest changing the
HYP
element to include a newHARD_HYPHEN
attribute, as follows: