altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Asian language (Japanese, Korean, Chinese) with ruby characters cannot be properly represented in ALTO. #51

Closed cowboyMontana closed 4 years ago

mittagessen commented 6 years ago

Aren't those already properly dealt with by Unicode interlinear annotation code points or standoff annotation? Explicit semantic markup isn't part of ALTO right now and introducing it might be a slippery slope into TEI territory.

artunit commented 6 years ago

ALTO has had a NamedEntityTag since 2014 so there is some precedent for semantics now. I wonder if the main challenge is that ALTO attempts to support the reconstruct of the original appearance of the object, and that requires some way to render interlinear annotation symbols. My understanding is that Unicode's provision is rarely implemented and is not intended for markup languages. The characters can be mapped out with ALTO's glyph support but that introduces other problems. I have a few examples I will add to this issue.

artunit commented 6 years ago

Here is an example for "Ghost in the Shell":

ruby

Glyphs are one option, where the text (攻殻機動隊) is in the String element, but the extra glyphs are identified seperately.

<?xml version="1.0" encoding="utf-8"?>
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns="http://www.loc.gov/standards/alto/ns-v4#" 
  xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-0.xsd" 
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <Styles>
    <!-- taking these from tesseract -->
    <TextStyle ID="font0" FONTFAMILY="Noto_Sans_Japanese_Light" FONTSIZE="16" />
  </Styles>
  <Tags>
    <LayoutTag ID="L01" TYPE="Includes ruby glyphs" />
  </Tags>
  <Layout>
    <!-- in this example, page specs include ruby characters, string metrics do not -->
    <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="179" WIDTH="499">
      <PrintSpace ID="P1_PS00001" HPOS="9" VPOS="11" WIDTH="486" HEIGHT="158">
        <TextBlock ID="P1_TB00001" HPOS="9" VPOS="11" WIDTH="486" HEIGHT="158" STYLEREFS="font0">
          <TextLine ID="P1_TL00001" HPOS="9" VPOS="11" WIDTH="486" HEIGHT="158">
            <String ID="P1_S00001" TAGREFS="L01" CONTENT="攻殻機動隊" WC="95.153336" BOLD="0" ITALIC="0" UNDERLINED="0" MONOSPACE="0" SERIF="0" SMALLCAPS="0">
              <!-- start of ruby characters -->
              <Glyph CONTENT="こ" HPOS="9" VPOS="19" WIDTH="30" HEIGHT="33" GC="82.728165">
                <Variant VC="77.799820">二</Variant>
                <Variant VC="74.297661">ご</Variant>
                <Variant VC="69.200447">乙</Variant>
              </Glyph>
              <Glyph CONTENT="う" HPOS="61" VPOS="12" WIDTH="24" HEIGHT="42" GC="78.213593">
                <Variant VC="74.154175">ぅ</Variant>
              </Glyph>
              <!-- tesseract really doesn't come close on this, should be か -->
              <Glyph CONTENT="菫" HPOS="105" VPOS="14" WIDTH="46" HEIGHT="86" GC="62.827019">
                <Variant VC="61.445938">董</Variant>
                <Variant VC="60.676205">童</Variant>
                <Variant VC="60.309708">茎</Variant>
                <Variant VC="60.085224">重</Variant>
                <Variant VC="59.154743">葦</Variant>
                <Variant VC="59.120598">革</Variant>
                <Variant VC="58.844460">窒</Variant>
                <Variant VC="58.631226">草</Variant>
                <Variant VC="58.597076">至</Variant>
              </Glyph>
              <Glyph CONTENT="〈" HPOS="162" VPOS="13" WIDTH="22" HEIGHT="40" GC="91.775093">
                <Variant VC="79.057526">く</Variant>
              </Glyph>
              <Glyph CONTENT="き" HPOS="214" VPOS="13" WIDTH="29" HEIGHT="40" GC="82.332146">
                <Variant VC="69.114243">春</Variant>
                <Variant VC="69.082031">吉</Variant>
                <Variant VC="67.942940">善</Variant>
              </Glyph>
              <Glyph CONTENT="ど" HPOS="274" VPOS="11" WIDTH="37" HEIGHT="41" GC="81.790161" />
              <Glyph CONTENT="う" HPOS="336" VPOS="12" WIDTH="24" HEIGHT="42" GC="78.052055">
                <Variant VC="73.923515">ぅ</Variant>
              </Glyph>
              <Glyph CONTENT="た" HPOS="389" VPOS="12" WIDTH="39" HEIGHT="40" GC="81.743462" />
              <Glyph CONTENT="い" HPOS="452" VPOS="18" WIDTH="37" HEIGHT="32" GC="87.747452">
                <Variant VC="77.132919">ぃ</Variant>
              </Glyph>
              <!-- start of string proper -->
              <Glyph CONTENT="攻" HPOS="4" VPOS="70" WIDTH="140" HEIGHT="92" GC="99.307617" />
              <Glyph CONTENT="殻" HPOS="138" VPOS="70" WIDTH="96" HEIGHT="91" GC="99.553749" />
              <Glyph CONTENT="機" HPOS="226" VPOS="71" WIDTH="109" HEIGHT="90" GC="99.265053" />
              <Glyph CONTENT="動" HPOS="343" VPOS="73" WIDTH="94" HEIGHT="89" GC="99.232162" />
              <Glyph CONTENT="隊" HPOS="429" VPOS="73" WIDTH="66" HEIGHT="88" GC="99.152809" />
            </String>
          </TextLine>
        </TextBlock>
      </PrintSpace>
    </Page>
  </Layout>
</alto>

HTML has an ruby element. By default, browsers place ruby characters over the string, which is common, but ruby characters may also be below and beside strings (especially when the text is written virtually).

  <ruby>
    攻殻
    <rt>こうかく</rt>
    機動隊
    <rt>きどうたい</rt>
  </ruby>

The key is probably to find a way to identify ruby characters, perhaps an attribute on TextLine, String or Glyph?

mittagessen commented 6 years ago

Wikipedia claims Unicode ruby is not intended for markup languages but the recommendation has been withdrawn by the consortium. I agree it wouldn't be cleanest way to encode it, especially as any OCR engine I know of would recognize it as two separate lines making the feature non-implementable for all practical purposes.

A RoleTag would allow simple markup of ruby role but linking multiple elements still isn't provided by any ALTO facility.

artunit commented 6 years ago

Yes, I think difficult encodings would get messy very quickly. For example, I don't think it's uncommon in manga to have furigana dots beside ruby characters.

ruby/furigana

I don't think my approach is the right one. Glyphs can technically capture text that is not typographically part of String while keeping the parts of the text together but the schema currently states that the "glyph elements are in order of the word. Each glyph need to be recorded to built up the whole word sequence.". The other ripple is that ruby characters are usually in a smaller and sometimes different font. OCR engines would approach approach the text as distinct lines, which can make for nonsensical lines, but it would allow for font variations. I think linking TextLine elements somehow would be the better solution.

artunit commented 6 years ago

This might be an opportunity to use RoleTag, which is available for "criteria about function or mission":

......
</Styles>
<Tags>
   <RoleTag ID="IL01" TYPE="Interlinear">
</Tags>
.......
<TextBlock>
    <TextLine TAGREFS="IL01">
    <TextLine>
</TextBlock>
.....

This would not require changing the schema and could also be used for other interlinear text scenarios, such as an interlinear gloss.

artunit commented 5 years ago

Just to close the loop on this, here is the same example without the glyph element.

<?xml version="1.0" encoding="utf-8"?>
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns="http://www.loc.gov/standards/alto/ns-v4#" 
  xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-0.xsd" 
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <Styles>
    <!-- taking these from tesseract -->
    <TextStyle ID="font0" FONTFAMILY="Noto_Sans_Japanese_Light" FONTSIZE="16" />
  </Styles>
  <Tags>
    <RoleTag ID="IL01" TYPE="Interlinear"/>
  </Tags>
  <Layout>
    <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="179" WIDTH="499">
      <PrintSpace ID="P1_PS00001" HPOS="9" VPOS="11" WIDTH="486" HEIGHT="158">
        <TextBlock ID="P1_TB00001" HPOS="9" VPOS="11" WIDTH="486" HEIGHT="158" STYLEREFS="font0">
          <TextLine ID="P1_TL00001" TAGREFS="IL01" HPOS="9" VPOS="11" WIDTH="486" HEIGHT="58">
            <!-- ruby string -->
            <String ID="P1_S00001" CONTENT="こうか〈きどたい" WC="94.254443" BOLD="0" ITALIC="0" UNDERLINED="0" MONOSPACE="0" SERIF="0" SMALLCAPS="0"/>
          </TextLine>
          <TextLine ID="P1_TL00002" HPOS="9" VPOS="11" WIDTH="486" HEIGHT="96">
            <!-- start of string proper -->
            <String ID="P1_S00002" CONTENT="攻殻機動隊" WC="94.254443" BOLD="0" ITALIC="0" UNDERLINED="0" MONOSPACE="0" SERIF="0" SMALLCAPS="0"/>
          </TextLine>
        </TextBlock>
      </PrintSpace>
    </Page>
  </Layout>
</alto>

This approach would also be an option for interlinear text.

ruby

<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns="http://www.loc.gov/standards/alto/ns-v4#" 
  xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-0.xsd" 
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <Tags>
    <RoleTag ID="IL01" TYPE="Interlinear"/>
  </Tags>
  <Layout>
    <Page WIDTH="455" HEIGHT="56" PHYSICAL_IMG_NR="0" ID="page_0">
      <PrintSpace HPOS="0" VPOS="0" WIDTH="455" HEIGHT="56">
        <TextBlock ID="block_0" HPOS="6" VPOS="12" WIDTH="441" HEIGHT="34">
          <TextLine ID="line_0" HPOS="6" VPOS="12" WIDTH="385" HEIGHT="15">
            <String ID="string_0" HPOS="6" VPOS="12" WIDTH="16" HEIGHT="12" WC="0.67" CONTENT="35"/>
            <String ID="string_1" HPOS="65" VPOS="12" WIDTH="26" HEIGHT="12" WC="0.64" CONTENT="But"/>
            <String ID="string_2" HPOS="96" VPOS="12" WIDTH="78" HEIGHT="15" WC="0.64" CONTENT="nathelees,"/>
            <String ID="string_3" HPOS="180" VPOS="12" WIDTH="30" HEIGHT="12" WC="0.91" CONTENT="whil"/>
            <String ID="string_4" HPOS="217" VPOS="12" WIDTH="2" HEIGHT="12" WC="0.74" CONTENT="|"/>
            <String ID="string_5" HPOS="226" VPOS="12" WIDTH="35" HEIGHT="12" WC="0.71" CONTENT="have"/>
            <String ID="string_6" HPOS="266" VPOS="12" WIDTH="37" HEIGHT="15" WC="0.71" CONTENT="tyme"/>
            <String ID="string_7" HPOS="309" VPOS="12" WIDTH="27" HEIGHT="12" WC="0.91" CONTENT="and"/>
            <String ID="string_8" HPOS="343" VPOS="15" WIDTH="48" HEIGHT="12" WC="0.81" CONTENT="space,"/>
          </TextLine>
          <TextLine ID="line_1" TAGREFS="IL01" HPOS="83" VPOS="31" WIDTH="364" HEIGHT="15">
            <String ID="string_9" HPOS="83" VPOS="31" WIDTH="23" HEIGHT="12" WC="0.92" CONTENT="But"/>
            <String ID="string_10" HPOS="112" VPOS="31" WIDTH="86" HEIGHT="12" WC="0.66" CONTENT="nonetheless,"/>
            <String ID="string_11" HPOS="208" VPOS="31" WIDTH="36" HEIGHT="12" WC="0.92" CONTENT="while"/>
            <String ID="string_12" HPOS="250" VPOS="31" WIDTH="2" HEIGHT="12" WC="0.56" CONTENT="I"/>
            <String ID="string_13" HPOS="259" VPOS="31" WIDTH="33" HEIGHT="12" WC="0.92" CONTENT="have"/>
            <String ID="string_14" HPOS="298" VPOS="31" WIDTH="30" HEIGHT="12" WC="0.84" CONTENT="time"/>
            <String ID="string_15" HPOS="334" VPOS="31" WIDTH="25" HEIGHT="12" WC="0.84" CONTENT="and"/>
            <String ID="string_16" HPOS="365" VPOS="31" WIDTH="82" HEIGHT="15" WC="0.48" CONTENT="opportunity,"/>
          </TextLine>
        </TextBlock>
      </PrintSpace>
    </Page>
  </Layout>
</alto>
artunit commented 5 years ago

Whoops, didn't mean to close!

artunit commented 4 years ago

In an effort to keep ahead of schema issues, ones without a direct schema implication will be closed if deemed to be no longer active or if the discussion has gone full circle. They can be reopened if requested.