RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces`

kba commented 4 years ago

METS/PAGE/ALTO provided by digitization workflow software or repositories will not always adhere to the conventions we have in OCR-D. OTOH the workspaces that are the result of OCR-D workflows contains a lot of redundant information that is not relevant for ingestion into production systems or contradict the local conventions of the production system.

Also, our conventions have been shifting and will continue to do so to meet the needs of users and developers.

Many users therefore have developed scripts to preprocess input and postprocess output of OCR-D.

OCR-D/core should provide a processor ocrd-sanitize which is only concerned with "housekeeping" of workspaces. Possible actions include:

Pruning of mets:fileGrp, either by allowlist or denylist. I.e. remove mets:fileGrp and containing mets:file (and files on disk) that are not required anymore
regex-based replacement of all xlink:href to match local conventions
Removing all but the lowest level of page:TextEquiv information in PAGE-XML
Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing
Upgrading older PAGE-XML namespaces to the latest version (#503)
Assigning persistent identifiers to work, pages, files ...

These are just some ideas, we'd love to hear yours. Please share your post-processing/post-processing scripts or feature requests for such a tool so we can develop a solution together for common tasks.

kba commented 4 years ago

@mikegerber

here is my collection of METS/PAGE file fixer scripts, as mentioned in the call: https://github.com/mikegerber/sbb-useful-hacks/tree/master/mets-fixers - not to be used lightly, no warranty, you have been warned 🚧 🚨 🚧

mikegerber commented 4 years ago

I don't know if I missed the point a bit, but I do see two different groups of use cases here:

Sanitizing/Repairing/maintaining invalid or outdated METS/workspaces:
- Tools like https://github.com/mikegerber/sbb-useful-hacks/tree/master/mets-fixers
- Upgrading older PAGE-XML namespaces to the latest version (#503)
- regex-based replacement of all xlink:href to match local conventions (possibly)
- Assigning persistent identifiers to work, pages, files ... (possibly)
Other post-processing
- Pruning of mets:fileGrp, either by allowlist or denylist. I.e. remove mets:fileGrp and containing mets:file (and files on disk) that are not required anymore
- Removing all but the lowest level of page:TextEquiv information in PAGE-XML
- Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing

Should these use case groups maybe put into two separate processors/tools?

kba commented 4 years ago

Should these use case groups maybe put into two separate processors/tools?

Yes, probably. Or even task-specific processors (ocrd-sanitize-prune-filegroups, ocrd-sanitize-textequiv ...)

kba commented 4 years ago

Of interest in this context: https://github.com/tboenig/AletheiaTools

kba commented 4 years ago

Another useful operation: Assign pcGtsId from the mets:file/@ID

mikegerber commented 4 years ago

Another useful operation: Assign pcGtsId from the mets:file/@ID

https://github.com/mikegerber/sbb-useful-hacks/blob/master/mets-fixers/fix-page-pcgtsid-to-be-mets-file-id

M3ssman commented 4 years ago

Something related: extract METS/MODS from xml_doc created from OAI-Response like this:

mets_root_el = xml_root.find('.//mets:mets', XMLNS)
if mets_root_el is not None:
     return ET.ElementTree(mets_root_el)

kba commented 4 years ago

Something related: extract METS/MODS from xml_doc created from OAI-Response like this:
mets_root_el = xml_root.find('.//mets:mets', XMLNS)
if mets_root_el is not None:
     return ET.ElementTree(mets_root_el)

Let's keep OAI-PMH in a separate issue, c.f. https://github.com/OCR-D/core/issues/539. Also, if you want to extract METS from a GetRecord OAI-PMH request on the command line with xmlstarlet, see https://github.com/OCR-D/core/pull/453#issuecomment-595757940

M3ssman commented 4 years ago

Snippet for METS/MODS fileGrp, using wl/bl approach:

def clear_fileGroups(xml_root, black_list=None, white_list=None):

    file_sections = xml_root.findall('.//mets:fileSec', XMLNS)

    if not file_sections or (len(file_sections) < 1):
        raise Exception('invalid xml data !')

    for file_section in file_sections:
        sub_groups = list(file_section)
        for sub_group in sub_groups:
            subgroup_label = sub_group.attrib['USE']
            if black_list:
                for fg in black_list:
                    if subgroup_label== fg:
                        file_section.remove(sub_group)
                        sanitze_pysical_strctMap(xml_root, subgroup_label)
            if white_list:
                if not subgroup_label in white_list:
                    file_section.remove(sub_group)
                    sanitze_pysical_strctMap(xml_root, subgroup_label)

def sanitze_pysical_strctMap(xml_root, file_ref):

    pages = xml_root.findall('.//mets:structMap[@TYPE="PHYSICAL"]/mets:div/mets:div[@TYPE="page"]', XMLNS)

    for page in pages:
        removals = []
        for fptr in page:
            file_id = fptr.attrib['FILEID']
            if file_ref in file_id:
                removals.append(fptr)
        if removals:
            for removal in removals:
                page.remove(removal)

M3ssman commented 4 years ago

Also convenient: re-index all METS-Filegroups after any undesired reference entries were dropped.

bertsky commented 3 years ago

My largest demand for a sanitizer would be ensuring ingest into Kitodo.Presentation / DFG-Viewer works.

According to this we are already close, but...

our ALTO must be v2.0 currently (see this issue) – unfortunately the DFG-Viewer profile does not say much more, although we already know that SP/newlines are an issue and /alto/Layout/Page/@WIDTH is extremely important, because Kitodo.Presentation needs to add the DFG footer (which comes in multiples of 1000px width IIUC) and therefore scales the images and thus needs to know by what amount to scale the ALTO coordinates accordingly
that means the XSLT from ocr-filetransform will not in general give the correct results for OCR-D generated PAGE, we should switch and recommend/document page-to-alto
our METS itself needs to conform to DFG-Viewer profile, which means that notably
- images must be in the DEFAULT fileGrp (whether by alias to another, existing fileGrp or by renaming I am not sure)
- ALTO must be in the FULLTEXT fileGrp (not sure what to do if multiple versions are available) and MIMETYPE="text/html" (not application/alto+xml!)
- files must be of LOCTYPE="URL" (but not sure about the kind of response the webserver needs to give, esp. whether it must understand and convey the correct Content-Type MIME or may omit it or use some nonsense like application/octet-stream)
- for every mets:file there must be exactly one FLocat (which was already discussed within the remote-local bookkeeping and partial manifestation idea)
- there must be a structMap of TYPE="PHYSICAL" with a mets:div of TYPE="physSequence" in it and at least one mets:div in that with TYPE="page" (i.e. at least one page) and a ORDER label
- there must be a structMap of TYPE="LOGICAL" with a mets:div of some TYPE in it ("the name is not important") and at least one mets:div in that with TYPE among these labels
- there must be a structLink linking each physical page to at least one logical element
- there must be a mets:dmdSec with at least some MODS or TEIHDR metadata
- there must be a mets:amdSec with at least some mets:techMD or external namespace metadata and some mets:rightsMD (with various dv:rights specs) and mets:digiprovMD (with dv:reference)

bertsky commented 3 years ago

I stand corrected: As this example by @stefanCCS – METS and ALTO – shows, MIMETYPE="application/alto+xml" and ALTO v4.1 do work actually. (That is, newer features are simply ignored.)

OCR-D / core

RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces` #544