OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces` #544

Open kba opened 4 years ago

kba commented 4 years ago

METS/PAGE/ALTO provided by digitization workflow software or repositories will not always adhere to the conventions we have in OCR-D. OTOH the workspaces that are the result of OCR-D workflows contains a lot of redundant information that is not relevant for ingestion into production systems or contradict the local conventions of the production system.

Also, our conventions have been shifting and will continue to do so to meet the needs of users and developers.

Many users therefore have developed scripts to preprocess input and postprocess output of OCR-D.

OCR-D/core should provide a processor ocrd-sanitize which is only concerned with "housekeeping" of workspaces. Possible actions include:

These are just some ideas, we'd love to hear yours. Please share your post-processing/post-processing scripts or feature requests for such a tool so we can develop a solution together for common tasks.

kba commented 4 years ago

@mikegerber

here is my collection of METS/PAGE file fixer scripts, as mentioned in the call: https://github.com/mikegerber/sbb-useful-hacks/tree/master/mets-fixers - not to be used lightly, no warranty, you have been warned 🚧 🚨 🚧

mikegerber commented 4 years ago

I don't know if I missed the point a bit, but I do see two different groups of use cases here:

  1. Sanitizing/Repairing/maintaining invalid or outdated METS/workspaces:

  2. Other post-processing

    • Pruning of mets:fileGrp, either by allowlist or denylist. I.e. remove mets:fileGrp and containing mets:file (and files on disk) that are not required anymore
    • Removing all but the lowest level of page:TextEquiv information in PAGE-XML
    • Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing

Should these use case groups maybe put into two separate processors/tools?

kba commented 4 years ago

Should these use case groups maybe put into two separate processors/tools?

Yes, probably. Or even task-specific processors (ocrd-sanitize-prune-filegroups, ocrd-sanitize-textequiv ...)

kba commented 4 years ago

Of interest in this context: https://github.com/tboenig/AletheiaTools

kba commented 4 years ago

Another useful operation: Assign pcGtsId from the mets:file/@ID

mikegerber commented 4 years ago

Another useful operation: Assign pcGtsId from the mets:file/@ID

https://github.com/mikegerber/sbb-useful-hacks/blob/master/mets-fixers/fix-page-pcgtsid-to-be-mets-file-id

M3ssman commented 4 years ago

Something related: extract METS/MODS from xml_doc created from OAI-Response like this:

mets_root_el = xml_root.find('.//mets:mets', XMLNS)
if mets_root_el is not None:
     return ET.ElementTree(mets_root_el)
kba commented 4 years ago

Something related: extract METS/MODS from xml_doc created from OAI-Response like this:

mets_root_el = xml_root.find('.//mets:mets', XMLNS)
if mets_root_el is not None:
     return ET.ElementTree(mets_root_el)

Let's keep OAI-PMH in a separate issue, c.f. https://github.com/OCR-D/core/issues/539. Also, if you want to extract METS from a GetRecord OAI-PMH request on the command line with xmlstarlet, see https://github.com/OCR-D/core/pull/453#issuecomment-595757940

M3ssman commented 4 years ago

Snippet for METS/MODS fileGrp, using wl/bl approach:

def clear_fileGroups(xml_root, black_list=None, white_list=None):

    file_sections = xml_root.findall('.//mets:fileSec', XMLNS)

    if not file_sections or (len(file_sections) < 1):
        raise Exception('invalid xml data !')

    for file_section in file_sections:
        sub_groups = list(file_section)
        for sub_group in sub_groups:
            subgroup_label = sub_group.attrib['USE']
            if black_list:
                for fg in black_list:
                    if subgroup_label== fg:
                        file_section.remove(sub_group)
                        sanitze_pysical_strctMap(xml_root, subgroup_label)
            if white_list:
                if not subgroup_label in white_list:
                    file_section.remove(sub_group)
                    sanitze_pysical_strctMap(xml_root, subgroup_label)

def sanitze_pysical_strctMap(xml_root, file_ref):

    pages = xml_root.findall('.//mets:structMap[@TYPE="PHYSICAL"]/mets:div/mets:div[@TYPE="page"]', XMLNS)

    for page in pages:
        removals = []
        for fptr in page:
            file_id = fptr.attrib['FILEID']
            if file_ref in file_id:
                removals.append(fptr)
        if removals:
            for removal in removals:
                page.remove(removal)
M3ssman commented 4 years ago

Also convenient: re-index all METS-Filegroups after any undesired reference entries were dropped.

bertsky commented 3 years ago

My largest demand for a sanitizer would be ensuring ingest into Kitodo.Presentation / DFG-Viewer works.

According to this we are already close, but...

bertsky commented 3 years ago

I stand corrected: As this example by @stefanCCS – METS and ALTO – shows, MIMETYPE="application/alto+xml" and ALTO v4.1 do work actually. (That is, newer features are simply ignored.)