Glyphs (IMPACT) - Githubissues

Jo-CCS commented 10 years ago

Submitter: Impact Submitted: 2013-02

use case Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. Each character has its own coordinate information and must be separately addressable as a distinct object. Correction and verification processes can be carried out for individual characters. Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on character level.

In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly.

implementation Glyphs are recorded in the element. This element is optional and a child element of . The glyph element may have a element (see above). The (recognized) character of the glyph is stored in the CONTENT attribute.

The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. Due to post-processing steps such as correction the values of both attributes may be inconsistent.

Each element may have an optional VALID attribute. This attribute may only have one of the following three values:

•“s” - expresses that the glyph is a suspicious character. The OCR software is not confident that it has recognized the glyph correctly. •“r” – the character has been rejected; the OCR is confident that this character is not the glyph. •“c” - The OCR software is not confident that it has recognized the glyph correctly. Each may have one or more elements. Each variant represents an option for the glyph that the OCR software could have chosen. The element’s VC attribute records a float value between 0 and 1 that expresses the level of confidence for the variant where is 1 is confident. This attribute is optional. If it is not available, the default value for the variant is “0”. The VC attribute’s semantic is similar to the WC attribute for the element.

example

<TextBlock ID="P4_TB00001">
  <TextLine ID="P4_TL00001">
    <Shape>
      <Rectangle HPOS="230" VPOS="216" WIDTH="987" HEIGHT="31" />
    </Shape>
    <String ID="P4_ST00001" CONTENT="12" WC="0.99" CC="02">
      <Shape>
        <Rectangle HPOS="230" VPOS="223" WIDTH="37" HEIGHT="24"/>
      </Shape>
      <Glyph ID="P4_ST00001_G01"  CONTENT="1" VALID="s" HPOS="230" VPOS="223" WIDTH="10" HEIGHT="24">
       <Shape>
        <Polygon  />
       </Shape>
       <Variant VC="0.2">l</Variant>
       <Variant VC="0.1">i</Variant>
     </Glyph>
     <Glyph ID="P4_ST00001_G02" CONTENT="2" HPOS="240" VPOS="223" WIDTH="10" HEIGHT="24"/>
       <Shape>
         <Polygon />
       </Shape>
       <Variant VC="0.5">s</Variant>
       <Variant VC="0.1">8</Variant>
     </Glyph>
    </String>
  </TextLine>
</TextBlock>

Proposed change (inital draft):

<xsd:complexType name="StringType" mixed="false">
  <xsd:annotation>
    <xsd:documentation>A sequence of chars. Strings are separated by     white spaces or hyphenation chars.</xsd:documentation>
  </xsd:annotation>
  <xsd:sequence minOccurs="0">
    <xsd:element name="Shape" type="ShapeType" minOccurs="0"/>
    <xsd:element name="Alternative" minOccurs="0" maxOccurs="unbounded">
    ..............
    <xsd:element name="Glyph" type="GlyphType" minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence> 
  <xsd:complexType name="GlyphType" mixed="false">
  <xsd:annotation> 
    <xsd:documentation>
      Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. 
      Each character has its own coordinate information and must be separately addressable as a distinct object.
      Correction and verification processes can be carried out for individual characters.
      Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on character level.
      In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants.
      The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph.
      The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly.

      The glyph elements are in order of the word. Each character need to be recoreded to built up the whole word sequence.

      The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute.
      Due to post-processing steps such as correction the values of both attributes may be inconsistent. 

    </xsd:documentation>
  </xsd:annotation>
  <xsd:sequence minOccurs="0">
    <xsd:element name="Shape" type="ShapeType" minOccurs="0"/>
    <xsd:element name="Variant" minOccurs="0" maxOccurs="unbounded">
      <xsd:annotation>
        <xsd:documentation>Any alternative for the glyth.</xsd:documentation>
      </xsd:annotation>
      <xsd:complexType>
        <xsd:simpleContent>
          <xsd:extension base="xsd:string">
            <xsd:attribute name="VC" type="xsd:float" use="optional">
            <xsd:annotation>
              <xsd:documentation>
                 Each variant represents an option for the glyph that the OCR software could have chosen.
                 The element’s VC attribute records a float value between 0 and 1 that expresses
                 the level of confidence for the variant where is 1 is confident.
                 This attribute is optional. If it is not available, the default value for the variant is “0”.

                 The VC attribute’s semantic is similar to the WC attribute for the String element.
              </xsd:documentation>
            </xsd:annotation>
          </xsd:attribute>
          <xsd:simpleType>
            <xsd:restriction base="xsd:float">
              <xsd:minInclusive value="0"/>
              <xsd:maxInclusive value="1"/>
            </xsd:restriction>
          </xsd:simpleType>
        </xsd:extension>
      </xsd:simpleContent>
      </xsd:complexType>
    </xsd:element>
  </xsd:sequence>
  <xsd:attribute name="ID" type="xsd:ID" use="optional"/>
  <xsd:attribute name="CONTENT" use="required">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:length fixed="true" value="1"/>
        <xsd:whiteSpace value="preserve"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:attribute>
  <xsd:attribute name="VALID">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:enumeration value="s"/>
        <xsd:enumeration value="r"/>
        <xsd:enumeration value="c"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:attribute>
</xsd:complexType>

jpmoreux commented 8 years ago

From the PAGE format

jpmoreux commented 8 years ago

http://primaresearch.org/publications/ICPR2010_Pletschacher_PAGE

cneud commented 8 years ago

Some more thoughts on this issue:

The term "Glyph" was coined in the IMPACT project, as e.g. ligatures or special characters like symbols etc. can not be appropriately described as individual "characters" in the true sense of the word. Coming from that direction, the use of the term "Glyph" appears more inclusive and correct.
Both FineReader-XML and hOCR already offer character level encoding. It should be possible to transform either of these two formats to ALTO without loss of information.
Certainly, the recording of additional information on the "Glyph" level such as e.g. coordinates, confidence values, formatting etc. will increase the file size of any ALTO file that utilizes this feature. However, from practical experience, the benefit of having such additional information does in most use cases defeat concerns over additional storage requirements. E.g. cultural heritage institutions like libraries, archives etc. typically have a policies such that all information that can be digitally captured from the original is also worth preserving digitally. Finally, the use of the Glyph level should be optional. This way, users can choose for themselves whether the benefits outweigh the performance/storage issues. However, not even providing the user with such choice can be considered a limitation of current ALTO.
It can occur that what has been recognized as a single glyph are actually two glyphs. A common example is the confusion of 'm' and 'rn'. Accordingly, when also variants consisting of two glyphs are being recorded, there is no information available on their individual pixel outlines. In such cases it may be preferable to encode the later as if it were a single glyph, e.g.

<Glyph ID="P1_ST00001_G04" CONTENT="m" HPOS="262" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.7">
     <Variant VC="0.2">rn</Variant>
</Glyph>

One common use case for the “Glyph” level is post-processing software for OCR correction (as e.g. http://ocr.cis.uni-muenchen.de/ ). Glyph level information is used here to determine possible “correction candidates”, i.e. words that have an overall lower word confidence but may be suitable replacements given one (e.g. low confidence) character is replaced to derive a higher confidence or dictionary word. The above tool which is used in several libraries in Germany does not currently support ALTO due to this limitation.
It would be preferable to record "Glyph" confidence scores in ALTO similar to the way word confidence score is recorded, i.e. by a float between 0 and 1 with two decimals. While a single digit integer does also work, it is more limiting for post-processing purposes. Obviously, a higher granularity would benefit post-correction algorithms, as e.g. what is "6" may be "0.56" or "0.64", which is a difference significant enough to lead to different results in post-processing. FineReader-XML and hOCR both use a 2- or 3-digit notation as in

<charParams l="317" t="397" r="331" b="426" ... charConfidence="100">t</charParams>
<charParams l="448" t="212" r="504" b="276" ... charConfidence="98" >e</charParams>

mittagessen commented 8 years ago

I've been implementing ALTO support for the OCR pipeline at the Digital Humanities chair at the University of Leipzig and being able to encode results on the lowest recognition granularity would enable us to reduce conversion losses from our "native" TEI facsimile format.

Float confidence values between 0 and 1 would also fit better in the general data model, as the different semantics for the CC field in the String tag seem rather arbitrary (it is also impossible to associate characters and confidences if the "unit" of recognition is unknown, e.g. for multi-codepoint glyphs).

If I understand the schema correctly, significant figures are unspecified as long as values fit in a 32bit float and we are fine with that and any sane parser should be able to deal with arbitrary precision inputs.

Lastly, the correct terminology for anything a human recognizes as a single character on the page according to Unicode TR29 is grapheme cluster. We are using this term throughout our documentation although it somewhat breaks down when an engine produces non-printable output, e.g. combining diacritic + character as two separate outputs instead of a single one.

Jo-CCS commented 8 years ago

Good morning and many thanks for the input.

We are currently discussing on this feature and were already discussing about the terminology. We will consider your input on the next discussion round and feel free to comment our updated draft expected in a couple of weeks. Regards, Jo

mittagessen commented 8 years ago

Thanks. Terminology-wise glyph seems to be in more widespread use while grapheme cluster has the benefit of being well defined by Unicode. As I said both terms don't encompass all corner cases and I'm not going to start a religious war over which one is better.

On another note hOCR does not allow usable glyph level encoding as the encoding schemes ('cuts' and 'x_boxes') use the same list syntax that makes alignment between glyphs and confidences from the CC attribute impossible in some cases.

Jo-CCS commented 8 years ago

The tech calls have taken place on 17th and 21st of March. 17th attended: Raju Buddharaju, Stefan Pletschacher, Joachim Bauer 21st attended: Raju Buddharaju, Jean-Philippe Moreaux, Jukka Kervinen, Joachim Bauer

Here the summary of the topics discussed and according conclusions / proposals for the final changes:

1) Clarification of the element name "glyph" Due to previous discussions and the post from "mittagessen" we discussed once more about the wording. After clarification, that ALTO is not just for OCR results than for any description of layout and text of page objects it was concluded that glyph is the right word for it. ALTO support the full XML possibility of unicode (UTF8/16) and has due to this the possible usage of pre-composed characters, e.g. from digital born material. Stefan and Raju oould bring in a lot of input for the clarification with useful samples. For Indian & Thai languages Raju provided a first small sample for the use-cases. These will be provided asap and shared with the release of the change finally. The same for the other 4 sample use-cases already created before.

Accordingly the value for the glyph element will be defined as follwed: precomposed representation = base + combining character(s) (decomposed representation) See http://www.fileformat.info/info/unicode/char/0101/index.htm "U+0101" = (U+0061) + (U+0304) "combining characters" = "base characters" in combination with non-spacing characters which are combined to one are represented as one "glyph", e.g. áàâ

2) File sizig - naming convention First tests with docWorks generated sample files of book pages (about A4 format) were performed. These files have been extended with glyph elements for each character and each having 3 variants caused the increase of factor 7. By shortening the attributes to one or two chars the increase could be reduced to factor 6. In both calls it was confirmed that the consistency of naming convention within the ALTO and human readability of the XML files is more important than file dimension. Also it is concluded that the compression will work out the issue. So the attributes should be kept as proposed initially. Also it was proposed to generate and directly provide the transformation to remove glyphs again for easy cut-out of the glyph elements and to generate "fast viewing" ALTO similar to fast-web images (see also next topic).

3) Redundancy of "Content" attribute of "String" element to the glyph elements It was reflected and confirmed to keep the redundancy. Two reasons for this: a) the backwards compatibility and easy possibility to make "fast viewing" ALTO files by an easy transformation removing the glyph elements b) String element more efficient for transformation for extracting all text (presentation systems) In annotation it will be clarified, that the string content should match the combination of the glyphs inside the string element.

4) Confidence values I raised the question, it GC on glyph level and VC on variant level is misleading, as both are the same thing from value point of view. It was proposed to keep naming convention and just make clear in the annotation, that the values are the same thing. The discussion about the method of the confidence value calculation as well as the demand to have multiple confidence values referenced from different engines was excluded of this change request. Reason for this was, that all other elements (string, page) also do not support multiple references and further more the recording of the engines need to be done as well. Right now on regular digitization we suppose to have just final result of combination of engines / processing steps recorded. Finally Jean-Philippe will provide a sample of BnF internal research project, where such multi-engine confidence values were recorded to get best impression of all surrounding need for such research use-case. So it was confirmed to clarify this point afterwards on the other issues (https://github.com/altoxml/schema/issues/13, https://github.com/altoxml/schema/issues/23)

5) Special case of combined glyphs. On the 5 samples it was discussed the way how to describe glyphs and its variants, in case the variants might be multiple base characters. E.g. an "m" has as variant "r"+"n" or and "n" might be also an "i" + "i". Here no final statement could be found. The votings were going the directions to keep simple. But the full board should reflect the impact and need on this once more. The different proposed options which came out of the discussionsn were as followed:

   <!-- Option 1: keep simple and have multiple characters in variants without further information -->
    <Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <Variant VC="0.7">rn</Variant>
        <Variant VC="0.1">iii</Variant>
    </Glyph>

    <!-- Option 2: keep simple and have multiple characters in variants, but adding coordinates (optional) further more -->
    <Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <!-- multiple chars in a variant?-->
        <Variant CONTENT="rn" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9"/>
        <Variant CONTENT="iii" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9"/>
    </Glyph>

    <!-- Option 3: usage of same logic from file references in METS with seq and par to outline possible combinations -->
    <Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <!-- multiple chars in a variant?-->
        <par>
          <seq>
            <Variant VC="0.7">r</Variant>
            <Variant VC="0.7">n</Variant>
          </seq>
          <seq>
            <Variant VC="0.7">i</Variant>
            <Variant VC="0.7">i</Variant>
            <Variant VC="0.7">i</Variant>
          </seq>
          <Variant VC="0.1">n</Variant>
        </par>
    </Glyph>

    <!-- Option 4: grouping glyphs in additional level above, here  -->
    <Variant ID="var_opt1">
      <Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <Variant VC="0.7">r</Variant>
      </Glyph>
    </Variant>
    <Variant ID="var_opt2">
      <Glyph ID="P1_ST00003_G03" CONTENT="n" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <Variant VC="0.1">i</Variant>
        <Variant VC="0.1">i</Variant>
      </Glyph>
      <Glyph ID="P1_ST00003_G03" CONTENT="i" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <Variant VC="0.1">l</Variant>
      </Glyph>
    </Variant>
    <Variant ID="var_opt3">
      <Glyph ID="P1_ST00003_G03" CONTENT="i" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <Variant VC="0.1">l</Variant>
      </Glyph>
      <Glyph ID="P1_ST00003_G03" CONTENT="i" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <Variant VC="0.1">l</Variant>
      </Glyph>
      <Glyph ID="P1_ST00003_G03" CONTENT="i" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <Variant VC="0.1">l</Variant>
      </Glyph>
    </Variant>

    <!-- Option 5: grouping glyphs in additional level above, here  -->
    <Glyph ID="P1_ST00003_G03" CONTENT="m" HPOS="253" VPOS="223" WIDTH="9" HEIGHT="24" GC="0.9">
        <!-- multiple chars in a variant?-->
        <Variant VC="0.7" VAR_TYPE="Part1" VARContent="rn">r</Variant>
        <Variant VC="0.7" VAR_TYPE="Part2" VARContent="rn">n</Variant>
    </Glyph>

All having taken part on the calls thanks for the input and time. Please comment and correct if something is not matching your understanding or in case of mistakes. All others welcome to provide comments on it. We will also shortly wrap it up on todays call and based on this I will then finalize the sample files for the different use-cases and adapt the schema proposal accordingly to come close to the final change request for review and acceptance. Regards, jo

jpmoreux commented 8 years ago

ABBYY full output for variants

<wordRecVariants>
    <wordRecVariant wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="1" wordPenalty="0" meanStrokeWidth="40">
        <variantText>18e<charParams l="1977" t="197" r="1994" b="237" charConfidence="21" serifProbability="100">
                <charRecVariants>
                    <charRecVariant charConfidence="25" serifProbability="255">i</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">1</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">I</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">l</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">Î</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">Ï</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">î</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">ï</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">!</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">{</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">A</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">a</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">À</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">Â</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">à</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">â</charRecVariant>
                </charRecVariants>1</charParams>
            <charParams l="1977" t="197" r="1994" b="237" charConfidence="88" serifProbability="75">
                <charRecVariants>
                    <charRecVariant charConfidence="88" serifProbability="75">8</charRecVariant>
                    <charRecVariant charConfidence="19" serifProbability="73">S</charRecVariant>
                    <charRecVariant charConfidence="19" serifProbability="73">s</charRecVariant>
                    <charRecVariant charConfidence="16" serifProbability="43">B</charRecVariant>
                    <charRecVariant charConfidence="16" serifProbability="43">b</charRecVariant>
                    <charRecVariant charConfidence="15" serifProbability="255">3</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">6</charRecVariant>
                </charRecVariants>8</charParams>
            <charParams l="1977" t="197" r="1994" b="237" charConfidence="40" serifProbability="40">
                <charRecVariants>
                    <charRecVariant charConfidence="50" serifProbability="100">6</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">e</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">è</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">é</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">ê</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">ë</charRecVariant>
                    <charRecVariant charConfidence="29" serifProbability="255">fi</charRecVariant>
                    <charRecVariant charConfidence="24" serifProbability="255">®</charRecVariant>
                    <charRecVariant charConfidence="16" serifProbability="27">8</charRecVariant>
                    <charRecVariant charConfidence="15" serifProbability="43">B</charRecVariant>
                    <charRecVariant charConfidence="15" serifProbability="43">b</charRecVariant>
                    <charRecVariant charConfidence="14" serifProbability="32">S</charRecVariant>
                    <charRecVariant charConfidence="14" serifProbability="32">s</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">g</charRecVariant>
                </charRecVariants>e</charParams>
        </variantText>
    </wordRecVariant>
    <wordRecVariant wordFromDictionary="0" wordNormal="0" wordNumeric="0" wordIdentifier="0" wordPenalty="7" meanStrokeWidth="40">
        <variantText>I8e<charParams l="1977" t="197" r="1994" b="237" charConfidence="21" serifProbability="100">
                <charRecVariants>
                    <charRecVariant charConfidence="25" serifProbability="255">i</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">1</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">I</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">l</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">Î</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">Ï</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">î</charRecVariant>
                    <charRecVariant charConfidence="21" serifProbability="100">ï</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">!</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">{</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">A</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">a</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">À</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">Â</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">à</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">â</charRecVariant>
                </charRecVariants>I</charParams>
            <charParams l="1977" t="197" r="1994" b="237" charConfidence="88" serifProbability="75">
                <charRecVariants>
                    <charRecVariant charConfidence="88" serifProbability="75">8</charRecVariant>
                    <charRecVariant charConfidence="19" serifProbability="73">S</charRecVariant>
                    <charRecVariant charConfidence="19" serifProbability="73">s</charRecVariant>
                    <charRecVariant charConfidence="16" serifProbability="43">B</charRecVariant>
                    <charRecVariant charConfidence="16" serifProbability="43">b</charRecVariant>
                    <charRecVariant charConfidence="15" serifProbability="255">3</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">6</charRecVariant>
                </charRecVariants>8</charParams>
            <charParams l="1977" t="197" r="1994" b="237" charConfidence="40" serifProbability="40">
                <charRecVariants>
                    <charRecVariant charConfidence="50" serifProbability="100">6</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">e</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">è</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">é</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">ê</charRecVariant>
                    <charRecVariant charConfidence="40" serifProbability="40">ë</charRecVariant>
                    <charRecVariant charConfidence="29" serifProbability="255">fi</charRecVariant>
                    <charRecVariant charConfidence="24" serifProbability="255">®</charRecVariant>
                    <charRecVariant charConfidence="16" serifProbability="27">8</charRecVariant>
                    <charRecVariant charConfidence="15" serifProbability="43">B</charRecVariant>
                    <charRecVariant charConfidence="15" serifProbability="43">b</charRecVariant>
                    <charRecVariant charConfidence="14" serifProbability="32">S</charRecVariant>
                    <charRecVariant charConfidence="14" serifProbability="32">s</charRecVariant>
                    <charRecVariant charConfidence="13" serifProbability="255">g</charRecVariant>
                </charRecVariants>e</charParams>
        </variantText>
    </wordRecVariant>
</wordRecVariants>

jpmoreux commented 8 years ago

If we agree to focuse on how to record variants for ONE OCR engine (as discussed during the meeting), the case n. 5 doesn't exist. Let's suppose an engine has to segment "my" and split the word in 2 glyphs. It will eventually output variants 'm' / 'n' / 'h' for the first glyph, but never a 2 glyphs variant. For the second glyph: 'y' / 'j'. And at the word level, this engine will propose word variants: "my" / "ny" / "nj" / "ny"... Another engine will segment "my" in 3 glyphs, and maybe the variants for the first one will be: 'i' / 'r' Etc. etc. Consequently, if we want to adress text recognition variants, we need to store word and character variants. And we don't need to keep the coordinates in these variants (because we don't address the segmentation variants issue).

Jo-CCS commented 8 years ago

The adaptions are taken to version 3.2 draft schema. In addition to previous agreed change finally just the content of the Variants was taken to attribute instead as an element value. Updated samples are available as well at: https://github.com/altoxml/documentation/tree/master/v3/Glyph

I ask everyone to review and accept / reject the change

splet commented 7 years ago

In principal okay but our main developer had the following comments:

Glyph variants: The main glyphs are restricted to length 1 but variants to length 3. This could be a bit inconvenient when dealing with OCR results. Say FineReader returns 5 options, some with length 1 and some longer. What happens if the first one is not of length 1, does the ALTO exporter tool then check if there is one with length 1 among the other options and change the order? And why three? For Latin that would probably cover most cases, but for other scripts there might be longer ones.

HYP: Should variants also be considered for hyphens?

cowboyMontana commented 7 years ago

ACCEPT

altomator commented 7 years ago

ACCEPT

cneud commented 7 years ago

ACCEPT

jukervin commented 7 years ago

ACCEPT

libmanuk commented 7 years ago

ACCEPT

rajubln commented 7 years ago

ACCEPT

bkgeig commented 7 years ago

accept

ntra00 commented 7 years ago

accept

splet commented 7 years ago

Accept (comment 27 Oct 2016 to be raised as a new issue)

Jo-CCS commented 7 years ago

Stephan, I thought we had discussed this on the last call already, but as it is so long time back I am not sure any more either. So sorry for not having updated here. Will reply on the individual items.

cneud commented 7 years ago

Is there an XSD with the proposed glyph changes anywhere here? I could not find one. I would like to use this as the basis to edit in the proposed changes in #39 for v4.0.

EDIT: My bad, should have checked branches first...found it!

cneud commented 6 years ago

Included in v4.0.

altoxml / schema

Glyphs (IMPACT) #26