altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Add Processing to replace OCRProcessing #13

Closed jukervin closed 8 years ago

jukervin commented 10 years ago

The current process recording elements are fixed with OCR and on the other hand bit redundand. I think it would make sense to change OCRProcessing to Processing and the preProcessingStep,ocrProcessingStep, postProcessingStep to generic processingStep with processingStepType element to record the type of processing performed.

Currently:

<OCRProcessing ID="OCRPROCESSING_1">
  <preProcessingStep>
    <processingDateTime>2009-10-19</processingDateTime>
    <processingAgency>CCS Content Conversion Specialists GmbH, 
    </processingAgency>
    <processingStepDescription>align</processingStepDescription>
    <processingStepSettings>CCS OCR Processing Filter</processingStepSettings>
     <processingSoftware>
         <softwareCreator>CCS Content Conversion Specialists GmbH,Germany</softwareCreator>
         <softwareName>CCS docWORKS</softwareName>
         <softwareVersion>6.3-0.91</softwareVersion>
         <applicationDescription/>
       </processingSoftware>
    </preProcessingStep>
    <ocrProcessingStep>
    <processingSoftware>
    <softwareCreator>ABBYY (BIT Software), Russia</softwareCreator>
      <softwareName>FineReader</softwareName>
      <softwareVersion>8.1</softwareVersion>
    </processingSoftware>
  </ocrProcessingStep>
</OCRProcessing>

Suggestion

<Processing>
  <ProcessingStep ID="01">
    <processingDateTime>2009-10-19T10:10:10+05:00</processingDateTime>
    <processingStepType>image processing</processingStepType>
    <processingAgency>ACME Processing</processingAgency>
    <processingStepDescription>align</processingStepDescription>
    <processingStepSettings>ACME OCR Processing Filter</processingStepSettings>
    <processingSoftware>
      <softwareCreator>CCS Content Conversion Specialists GmbH, Germany</softwareCreator>
      <softwareName>CCS docWORKS</softwareName>
      <softwareVersion>6.3-0.91</softwareVersion>
      <softwareDescription/>
    </processingSoftware>
  </ProcessingStep>
  <ProcessingStep ID="02">
    <processingDateTime>2009-10-19T10:21:14+05:00</processingDateTime>
    <processingStepType>OCR</processingStepType>
    <processingAgency>CCS Content Conversion Specialists GmbH, www.content-conversion.com</processingAgency>
    <processingStepDescription></processingStepDescription>
    <processingStepSettings></processingStepSettings>
    <processingSoftware>
      <softwareCreator>ABBYY (BIT Software), Russia</softwareCreator>
      <softwareName>FineReader</softwareName>
      <softwareVersion>8.1</softwareVersion> 
      <softwareDescription/>
    </processingSoftware>
  </ProcessingStep>
  <ProcessingStep ID="03">
     <processingDateTime>2009-10-19T15:28:30+05:00</processingDateTime>
     <processingStepType>Proofreading</processingStepType>
     <processingAgency>ACME Corp.</processingAgency>
     <processingStepDescription></processingStepDescription>
     <processingStepSettings></processingStepSettings>
     <processingSoftware>
        <softwareCreator>ACME</softwareCreator>
        <softwareName>Proofreader</softwareName>
       <softwareVersion>9.9</softwareVersion>
       <softwareDescription/>
     </processingSoftware>
   </ProcessingStep>
</Processing>

Schema changes:

<xsd:element name="OCRProcessing" minOccurs="0" maxOccurs="unbounded">
+  <xsd:annotation>
+    <xsd:documentation>DEPRECATED: Processing element should be used instead. 
+  </xsd:documentation>
 <xsd:complexType>
   <xsd:complexContent>
     <xsd:extension base="ocrProcessingType">
       <xsd:attribute name="ID" type="xsd:ID" use="required"/>
     </xsd:extension>
   </xsd:complexContent>
</xsd:complexType>

+<xsd:element name="Processing" minOccurs="0" maxOccurs="unbounded">
+  <xsd:complexType>
+     <xsd:complexContent>
+       <xsd:extension base="ProcessingStepType">
+         <xsd:attribute name="ID" type="xsd:ID" use="required"/>
+       </xsd:extension>
+      </xsd:complexContent>
+  </xsd:complexType>

<xsd:complexType name="ProcessingStepType">
<xsd:annotation> 
  <xsd:documentation>A processing step.</xsd:documentation>
</xsd:annotation>
 <xsd:sequence>

+  <xsd:element name="processingStepType" type="xsd:string" minOccurs="0"> 
+   <xsd:annotation>
+    <xsd:documentation>Type of processing step</xsd:documentation>
+   </xsd:annotation>
+  </xsd:element>

  <xsd:element name="processingDateTime" type="dateTimeType" minOccurs="0"> 
   <xsd:annotation>
    <xsd:documentation>Date or DateTime the image was processed.</xsd:documentation>
   </xsd:annotation>
  </xsd:element>
  <xsd:element name="processingAgency" type="xsd:string" minOccurs="0">
   <xsd:annotation>
    <xsd:documentation>Identifies the organizationlevel producer(s) of the
      processed image.</xsd:documentation>
   </xsd:annotation>
  </xsd:element>
  <xsd:element name="processingStepDescription" type="xsd:string" minOccurs="0" maxOccurs="unbounded">
   <xsd:annotation>
    <xsd:documentation>An ordinal listing of the image processing steps performed.
        For example, "image despeckling."</xsd:documentation>
   </xsd:annotation>
  </xsd:element>
  <xsd:element name="processingStepSettings" type="xsd:string" minOccurs="0">
   <xsd:annotation>
    <xsd:documentation>A description of any setting of the processing application.
        For example, for a multi-engine OCR application this might include the
        engines which were used. Ideally, this description should be adequate so
        that someone else using the same application can produce identical
        results.</xsd:documentation>
   </xsd:annotation>
  </xsd:element>
  <xsd:element name="processingSoftware" type="processingSoftwareType" minOccurs="0"/>
  </xsd:sequence>
</xsd:complexType> 
cneud commented 8 years ago

This seems very sensible to me!

Having a generic Processing and an ID attribute for a processingStep would seem to me to also satisfy much of what has been requested in #35. What it is still missing though is a way to track, which exact elements have been produced or altered by a particular processingStep.

Jo-CCS commented 8 years ago

To track the changes of element will be imposisble to cover within an XML file, as XML is hierarchical structured and the change by (post-)processing actions will also cause change of hiararchy, which cannot be recorded. Also elements might be removed which then cannot be referenced any more. In such case it makes much more sense to clone files, just add the history recordings to know which file has which status and to compare. Storage managements systems do the rest to prevent full redundant data holding by just saving the changes and keep ability to roll back to former version.

cneud commented 8 years ago

Continued in #39.