Open mikegerber opened 2 years ago
(Alternatively, page-to-alto could add it, of course.)
Can you provide an example of PAGE input and how you'd like to see it converted. page-to-alto should convert processing metadata, cf. https://github.com/kba/page-to-alto/blob/master/ocrd_page_to_alto/convert.py#L248-L265
Yes it does convert the processing metadata correctly, but does not add itself as a processing step - which would have been helpful as I was investigating whether page-to-alto was used for the conversion using ocrd-fileformat-transform. Here is an example, this was converted using ocrd-fileformat-transform:
<Processing ID="ocrd-eynollah-segment-0">
<processingStepDescription>layout/segmentation/region</processingStepDescription>
<processingStepSettings>{"models": "/data/default", "dpi": "0", "full_layout": "True", "curved_line": "False", "allow_scaling": "False", "headers_off": "False"}</processingStepSettings>
<processingSoftware>
<softwareName>ocrd-eynollah-segment</softwareName>
</processingSoftware>
</Processing>
<Processing ID="ocrd-sbb-binarize-1">
<processingStepDescription>preprocessing/optimization/binarization</processingStepDescription>
<processingStepSettings>{"model": "/data/sbb_binarization/models", "operation_level": "page"}</processingStepSettings>
<processingSoftware>
<softwareName>ocrd-sbb-binarize</softwareName>
</processingSoftware>
</Processing>
<Processing ID="ocrd-tesserocr-recognize-2">
<processingStepDescription>layout/segmentation/region</processingStepDescription>
<processingStepSettings>{"model": "deu", "dpi": "0", "padding": "0", "segmentation_level": "word", "textequiv_level": "word", "overwrite_segments": "False", "overwrite_text": "True", "shrink_polygons": "False", "block_polygons": "False", "find_tables": "True", "sparse_text": "False", "raw_lines": "False", "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": "{}", "xpath_parameters": "{}", "xpath_model": "{}", "auto_model": "False", "oem": "DEFAULT"}</processingStepSettings>
<processingSoftware>
<softwareName>ocrd-tesserocr-recognize</softwareName>
</processingSoftware>
</Processing>
Full PAGE + ALTO: example.zip
What I would expect is an additional processing step like this (entirely made up):
<Processing ID="ocrd-fileformat-transform-3">
<processingStepDescription>conversion</processingStepDescription>
<processingStepSettings>{"backend": "page-to-alto"}</processingStepSettings>
<processingSoftware>
<softwareName>ocrd-fileformat-transform</softwareName>
</processingSoftware>
</Processing>
I know this is extra work but it's very useful to answer the question of how a file was created exactly.
Gotcha, yes this makes sense, at least for the OCR-D processor interface.
I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.
I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.
I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way. Then in ocrd-fileformat-transform, we can fully inform about the processor and its options. (While in other use cases, we might want to hide the conversion.)
But then again, doing this from page-to-alto or ocr-fileformat/script/transform/page__alto
is much easier than from ocrd_fileformat
. In the latter case, one would have to
/pc:PcGts/pc:Metadata/pc:MetadataItem
(as in ocrd-olena-binarize)/alto/Description/Processing
(as outlined above)/alto/Description/OCRProcessing/postProcessingStep
(in an analogous way)These editing commands should by done by a true XML editor, like xmlstarlet. That would have to be added to the system dependencies.
Perhaps one should even offer a parameter to make this postprocessing/annotation optional.
I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.
I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way.
It does processing, so why should it not add processing info? I think it's not correct to omit it.
I believe it would be helpful if the ocrd-fileformat-transform PAGE → ALTO transformation would add a
<Processing>
tag. I looked into to the file to figure out if https://github.com/kba/page-to-alto was used for the conversion and did not find a processing tag for the conversion, just for segmentation/binarization/OCR.