ocrd-fileformat-transform does not add an ALTO Processing tag

mikegerber commented 2 years ago

I believe it would be helpful if the ocrd-fileformat-transform PAGE → ALTO transformation would add a <Processing> tag. I looked into to the file to figure out if https://github.com/kba/page-to-alto was used for the conversion and did not find a processing tag for the conversion, just for segmentation/binarization/OCR.

mikegerber commented 2 years ago

(Alternatively, page-to-alto could add it, of course.)

kba commented 2 years ago

Can you provide an example of PAGE input and how you'd like to see it converted. page-to-alto should convert processing metadata, cf. https://github.com/kba/page-to-alto/blob/master/ocrd_page_to_alto/convert.py#L248-L265

mikegerber commented 2 years ago

Yes it does convert the processing metadata correctly, but does not add itself as a processing step - which would have been helpful as I was investigating whether page-to-alto was used for the conversion using ocrd-fileformat-transform. Here is an example, this was converted using ocrd-fileformat-transform:

    <Processing ID="ocrd-eynollah-segment-0">
      <processingStepDescription>layout/segmentation/region</processingStepDescription>
      <processingStepSettings>{"models": "/data/default", "dpi": "0", "full_layout": "True", "curved_line": "False", "allow_scaling": "False", "headers_off": "False"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-eynollah-segment</softwareName>
      </processingSoftware>
    </Processing>
    <Processing ID="ocrd-sbb-binarize-1">
      <processingStepDescription>preprocessing/optimization/binarization</processingStepDescription>
      <processingStepSettings>{"model": "/data/sbb_binarization/models", "operation_level": "page"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-sbb-binarize</softwareName>
      </processingSoftware>
    </Processing>
    <Processing ID="ocrd-tesserocr-recognize-2">
      <processingStepDescription>layout/segmentation/region</processingStepDescription>
      <processingStepSettings>{"model": "deu", "dpi": "0", "padding": "0", "segmentation_level": "word", "textequiv_level": "word", "overwrite_segments": "False", "overwrite_text": "True", "shrink_polygons": "False", "block_polygons": "False", "find_tables": "True", "sparse_text": "False", "raw_lines": "False", "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": "{}", "xpath_parameters": "{}", "xpath_model": "{}", "auto_model": "False", "oem": "DEFAULT"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-tesserocr-recognize</softwareName>
      </processingSoftware>
    </Processing>

Full PAGE + ALTO: example.zip

mikegerber commented 2 years ago

What I would expect is an additional processing step like this (entirely made up):

    <Processing ID="ocrd-fileformat-transform-3">
      <processingStepDescription>conversion</processingStepDescription>
      <processingStepSettings>{"backend": "page-to-alto"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-fileformat-transform</softwareName>
      </processingSoftware>
    </Processing>

I know this is extra work but it's very useful to answer the question of how a file was created exactly.

kba commented 2 years ago

Gotcha, yes this makes sense, at least for the OCR-D processor interface.

mikegerber commented 2 years ago

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

bertsky commented 2 years ago

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way. Then in ocrd-fileformat-transform, we can fully inform about the processor and its options. (While in other use cases, we might want to hide the conversion.)

bertsky commented 2 years ago

But then again, doing this from page-to-alto or ocr-fileformat/script/transform/page__alto is much easier than from ocrd_fileformat. In the latter case, one would have to

check the target format
in the case of PAGE-XML, add a /pc:PcGts/pc:Metadata/pc:MetadataItem (as in ocrd-olena-binarize)
in the case of ALTO-XML >= 4, add a /alto/Description/Processing (as outlined above)
in the case of ALTO-XML < 4, add a /alto/Description/OCRProcessing/postProcessingStep (in an analogous way)
in the case of hOCR...?

These editing commands should by done by a true XML editor, like xmlstarlet. That would have to be added to the system dependencies.

Perhaps one should even offer a parameter to make this postprocessing/annotation optional.

mikegerber commented 2 years ago

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way.

It does processing, so why should it not add processing info? I think it's not correct to omit it.

OCR-D / ocrd_fileformat

ocrd-fileformat-transform does not add an ALTO Processing tag #35