TEI4HTR / page2tei

A repository for illustrating the transformation of a PAGE XML file into XML-TEI format, resulting from experimentations made for the LECTAUREP project.
Creative Commons Attribution 4.0 International
15 stars 2 forks source link

Which transformation for PAGE XML elements ? xmlpage_to_tei.xsl v2 documentation #2

Closed HugoSchtr closed 2 years ago

HugoSchtr commented 2 years ago

In the second version of the XSL, transformations (from PAGE XML to TEI) proceed as such:

For metadata:

  <Metadata>
    <Creator>escriptorium</Creator>
    <Created>2021-10-07T07:46:39.064183+00:00</Created>
        <LastChange>2021-10-07T07:46:39.064229+00:00</LastChange>
  </Metadata>

becomes:

   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>FRAN_0025_3056_L-0</title>
            <respStmt>
               <resp>Transcribed with</resp>
               <name>escriptorium</name>
            </respStmt>
         </titleStmt>
         <publicationStmt>
            <p/>
         </publicationStmt>
         <sourceDesc>
            <p/>
         </sourceDesc>
      </fileDesc>
      <revisionDesc>
         <change when="2021-10-07T07:46:39.064183+00:00">Creation</change>
         <change when="2021-10-07T07:46:39.064229+00:00">Last change</change>
      </revisionDesc>
   </teiHeader>

For the transcription itself:

  <Page imageFilename="FRAN_0025_3056_L-0.jpg" imageWidth="2894" imageHeight="4393">
...

becomes:

<sourceDoc>
      <graphic url="FRAN_0025_3056_L-0.jpg" source="" width="2894px" height="4393px"/>
...

Every <TextRegion> and every baseline (masks and baselines):

    <TextRegion id="eSc_textblock_afbab800"  custom="structure {type:col_1;}">
      <Coords points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615"/>

      <TextLine id="eSc_line_86b00a8e" >
        <Coords points="285,838 293,812 322,798 380,801 377,863 289,874"/>
        <Baseline points="289,841 389,845"/>
        <TextEquiv>
          <Unicode>198</Unicode>
        </TextEquiv>
      </TextLine>
...

becomes:

<surfaceGrp xml:id="eSc_textblock_afbab800" type="structure_{type:col_1;}">
         <surface>
            <zone xml:id="eSc_line_86b00a8e"
                  type="mask"
                  points="285,838 293,812 322,798 380,801 377,863 289,874">
               <line type="baseline" points="289,841 389,845">198</line>
            </zone>
...
HugoSchtr commented 2 years ago

However, as stated in issue #1, since the points attribute requires at least 3 x,y pairs, we are currently non-TEI compliant.

HugoSchtr commented 2 years ago

Issue #1 is resolved, here's the new transformation for a baseline page2tei:

<TextRegion id="eSc_textblock_afbab800"  custom="structure {type:col_1;}">
      <Coords points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615"/>
      <TextLine id="eSc_line_86b00a8e" >
        <Coords points="285,838 293,812 322,798 380,801 377,863 289,874"/>
        <Baseline points="289,841 389,845"/>
        <TextEquiv>
          <Unicode>198</Unicode>
        </TextEquiv>
      </TextLine>
      <TextLine id="eSc_line_4218ebcd" >
        <Coords points="278,981 285,940 311,929 380,948 384,992 359,1028 318,1028 282,1006"/>
        <Baseline points="278,981 384,992"/>
        <TextEquiv>
          <Unicode>199</Unicode>
        </TextEquiv>
      </TextLine>
       ...

becomes:

<surfaceGrp xml:id="eSc_textblock_afbab800" type="structure_{type:col_1;}">
         <surface points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615">
            <zone xml:id="eSc_line_86b00a8e"
                  type="mask"
                  points="285,838 293,812 322,798 380,801 377,863 289,874">
               <path type="baseline" points="289,841 389,845"/>
               <line>198</line>
            </zone>
            <zone xml:id="eSc_line_4218ebcd"
                  type="mask"
                  points="278,981 285,940 311,929 380,948 384,992 359,1028 318,1028 282,1006">
               <path type="baseline" points="278,981 384,992"/>
               <line>199</line>
            </zone>
            ...

New version of the transformation now includes regions' coordinates from the page XML in the TEI with the <surface> element and its attribute points.