altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Process Result tracking (IMPACT) #27

Closed Jo-CCS closed 8 years ago

Jo-CCS commented 10 years ago

Champion: Clemens Neudecker
Submitter: Impact Submitted: 2013-02 Status: discussion


submitted - initial status when proposal is submitted

discussion - proposal is being discussed within the board

review - xsd code is being reviewed

accepted - proposal is accepted

rejected - proposal is rejected

draft - accepted proposal is in public commenting period

published - proposal is published in a schema version

Backwards compatible ?? To ALTO version ?

Purpose A lot of software tools and also human interactions are involved in different steps of the digitisation process. Each of them may affect an ALTO file by doing some refinements or corrections. From our point of view it would be desirable to keep track of the changes and verification done by the different agents which are involved in the digitisation process. This would allow a simple kind of a document history and gives also important information about the trustworthily of the whole document. If for example everything was verified by a service provider than we can asume that the quality of the document is very high. Storing the old values as well as the new ones would increase the filesize tremendously.

Correction and Validation are possible outcomes of the same process.

Implementation The ALTO schema already defines a element. The intention of this element is to record any details about those process steps that were carried out after the creation of the full text. The element is optional and not part of the actual page’s definition in ALTO.

In order to store information about the correction and verification process for individual text lines, words etc. the following elements are added to the section:

stores the type of process step. It is a free text field, though IMPACT internal constraints require the element’s value to be set to “correction”. • groups all elements regarding the result of the process. The element’s value attribute contains information about the outcome of the process. The element is repeatable. Each element represents a specific outcome of the process that is recorded in the element’s value attribute. This attribute may only contain two values: “corrected” or “verified”. • is an element that wraps around all elements that were processed with the actual result as stated in the element’s value attribute. • element contain the ID-value of an individual text line or word element. Unprocessed are not listed here. If an element had not been processed, the element is not listed within .

Example:

<postProcessingStep ID="0003">      
  <processingDateTime>2012-05-26T09:34:00+02:00</processingDateTime>      
  <processingAgency>ACME Agency</processingAgency>     
  <processingStepDescription>Proofreading</processingStepDescription>     
  <processingStepSettings>Double keying required</processingStepSettings>     
  <processingSoftware>
   <softwareCreator>ACME Software Corp.</softwareCreator>           
   <softwareName>Proofer</softwareName>
   <softwareVersion>12.1</softwareVersion>
   <applicationDescription>Distributed proofreading software</applicationDescription>     
  </processingSoftware>
  <processingResult value="Proof reading performed">
    <processedElements>
      <pe>P4_TB00003</pe>
      <pe>P4_TB00002</pe>
      <pe>P4_ST00004</pe>
    </processedElements>
  </processingResult>
  <processingResult value="Uncorrected">
    <processedElements>
      <pe>P4_TB00003</pe>
      <pe>P4_TB00002</pe>
      <pe>P4_ST00004</pe>
    </processedElements>
  </processingResult>
</postProcessingStep>

Schema changes draft

Current schema Changed schema


<xsd:complexType name="processingStepType">
  <xsd:annotation>
    <xsd:documentation>A processing step.</xsd:documentation>
  </xsd:annotation>
  <xsd:sequence>
    <xsd:element name="processingDateTime" type="dateTimeType" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>Date or DateTime the image was processed.</xsd:documentation> 
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingAgency" type="xsd:string" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>Identifies the organizationlevel producer(s) of the processed image.</xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingStepDescription" type="xsd:string"  minOccurs="0" maxOccurs="unbounded">
      <xsd:annotation>
        <xsd:documentation>An ordinal listing of the image processing steps performed. For example, "image despeckling."</xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingStepSettings" type="xsd:string" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>A description of any setting of the processing application.
        For example, for a multi-engine OCR application this might include the
        engines which were used. Ideally, this description should be adequate so
        that someone else using the same application can produce identical
        results.
        </xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingSoftware" type="processingSoftwareType" minOccurs="0"/>    
  </xsd:sequence>
</xsd:complexType>
<xsd:complexType name="processingStepType">
  <xsd:annotation>
    <xsd:documentation>A processing step.</xsd:documentation>
  </xsd:annotation>
  <xsd:sequence>
    <xsd:element name="processingStepType" type="dateTimeType" minOccurs="0">    
      <xsd:annotation>
        <xsd:documentation>Type of processing step</xsd:documentation>
      </xsd:annotation>
   </xsd:element>
   <xsd:element name="processingDateTime" type="dateTimeType" minOccurs="0">    <xsd:annotation>    <xsd:documentation>Date or DateTime the image was processed.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingAgency" type="xsd:string" minOccurs="0">   <xsd:annotation>    <xsd:documentation>Identifies the organizationlevel producer(s) of the
      processed image.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingStepDescription" type="xsd:string"               minOccurs="0" maxOccurs="unbounded">   <xsd:annotation>    <xsd:documentation>An ordinal listing of the image processing steps performed.
        For example, "image despeckling."</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingStepSettings" type="xsd:string" minOccurs="0">   <xsd:annotation>    <xsd:documentation>A description of any setting of the processing application.
        For example, for a multi-engine OCR application this might include the
        engines which were used. Ideally, this description should be adequate so
        that someone else using the same application can produce identical
        results.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingSoftware" type="processingSoftwareType"               minOccurs="0"/>  <xsd:element name="processingResult" type="processingResultType"               minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence></xsd:complexType>  
  <xsd:complexType name="processingResultType">
 <xsd:annotation>  <xsd:documentation>List of processed elements.</xsd:documentation>
 </xsd:annotation>
 <xsd:sequence>
  <xsd:element name="processedElements" minOccurs="0" maxOccurs="unbounded">
   <xsd:annotation>
    <xsd:documentation>ID of processed element</xsd:documentation>
   </xsd:annotation>
   <xsd:complexType>
    <xsd:sequence>
     <xsd:element name="pe" type="xsd:IDREF" minOccurs="1" maxOccurs="unbounded">     </xsd:element>
    </xsd:sequence>
   </xsd:complexType>
  </xsd:element>
 </xsd:sequence>
 <xsd:attribute name="value" type="xsd:string"></xsd:attribute>
</xsd:complexType>  
cneud commented 8 years ago

Reviewing the original change request filed by the IMPACT project, it seems as two changes are requested:

  1. Add an attribute ID to the processingStepType - covered by #13
  2. Add two attributes CORRECTEDBY and VERIFIEDBY for all elements. The attributes are holding a list of references (using the ID attribute) to all processingStepType entries which have changed the original value.

Example:

<processingStep ID="ID005">
    <processingDateTime>2010-12-15T15:02:48</processingDateTime>
    <processingAgency>ACME Agency</processingAgency>
    <processingStepDescription>manual correction</processingStepDescription>
    <processingStepSettings>misc. settings</processingStepSettings>
    <processingSoftware>
        <softwareCreator>USAL</softwareCreator>
        <softwareName>Aletheia</softwareName>
        <softwareVersion>1.2.3</softwareVersion>
    </processingSoftware>
</processingSteps>

<TextLine ID="ID069" STYLEREFS="ID007" BASELINE="1261" CORRECTEDBY="ID005" VPOS="1230" HPOS="260" HEIGHT="40" WIDTH="902">

Justification:

"A lot of software tools and also human interactions are involved in different steps of the digitisation process. Each of them may affect an ALTO file by doing some refinements or corrections. From our point of view it would be desirable to keep track of the changes and verification done by the different agents which are involved in the digitisation process. This would allow a simple kind of a document history and gives also important information about the trustworthily of the whole document. If for example everything was verified by a service provider than we can asume that the quality of the document is very high. Storing the old values as well as the new ones would increase the filesize tremendously. Therefore we suggest to store only the information about what has been changed and by whom without keeping track of the changed values."

Jo-CCS commented 8 years ago

A post-processing actopm like new layout analysis (like outlined in #36 ) will cause too big changes to be able to track in such method. So the use-case for sich referencing might be quite limited in my point of view. But as you will loose original text information I would in repsonsible position for a long term-pres. storage not allow to overwrite these and anywhay keep a copy of the files. From those projects I made on national libraries I even heared that it is not allowed to adapt files in the repository at all and is always a new version placed. So for me the question remain, which additional information I get by this information and how I can use.

Finally on the other side it is simple extension, will only be for optional usage and does not cause a structural issue. I would just shorten to also prevent data issue (CORR= / VERIFIED=).

cneud commented 8 years ago

Continued in #39.