Open kba opened 4 years ago
here is my collection of METS/PAGE file fixer scripts, as mentioned in the call: https://github.com/mikegerber/sbb-useful-hacks/tree/master/mets-fixers - not to be used lightly, no warranty, you have been warned 🚧 🚨 🚧
I don't know if I missed the point a bit, but I do see two different groups of use cases here:
Sanitizing/Repairing/maintaining invalid or outdated METS/workspaces:
Other post-processing
Should these use case groups maybe put into two separate processors/tools?
Should these use case groups maybe put into two separate processors/tools?
Yes, probably. Or even task-specific processors (ocrd-sanitize-prune-filegroups
, ocrd-sanitize-textequiv
...)
Of interest in this context: https://github.com/tboenig/AletheiaTools
Another useful operation: Assign pcGtsId
from the mets:file/@ID
Another useful operation: Assign
pcGtsId
from themets:file/@ID
Something related: extract METS/MODS from xml_doc created from OAI-Response like this:
mets_root_el = xml_root.find('.//mets:mets', XMLNS)
if mets_root_el is not None:
return ET.ElementTree(mets_root_el)
Something related: extract METS/MODS from xml_doc created from OAI-Response like this:
mets_root_el = xml_root.find('.//mets:mets', XMLNS) if mets_root_el is not None: return ET.ElementTree(mets_root_el)
Let's keep OAI-PMH in a separate issue, c.f. https://github.com/OCR-D/core/issues/539. Also, if you want to extract METS from a GetRecord OAI-PMH request on the command line with xmlstarlet, see https://github.com/OCR-D/core/pull/453#issuecomment-595757940
Snippet for METS/MODS fileGrp, using wl/bl approach:
def clear_fileGroups(xml_root, black_list=None, white_list=None):
file_sections = xml_root.findall('.//mets:fileSec', XMLNS)
if not file_sections or (len(file_sections) < 1):
raise Exception('invalid xml data !')
for file_section in file_sections:
sub_groups = list(file_section)
for sub_group in sub_groups:
subgroup_label = sub_group.attrib['USE']
if black_list:
for fg in black_list:
if subgroup_label== fg:
file_section.remove(sub_group)
sanitze_pysical_strctMap(xml_root, subgroup_label)
if white_list:
if not subgroup_label in white_list:
file_section.remove(sub_group)
sanitze_pysical_strctMap(xml_root, subgroup_label)
def sanitze_pysical_strctMap(xml_root, file_ref):
pages = xml_root.findall('.//mets:structMap[@TYPE="PHYSICAL"]/mets:div/mets:div[@TYPE="page"]', XMLNS)
for page in pages:
removals = []
for fptr in page:
file_id = fptr.attrib['FILEID']
if file_ref in file_id:
removals.append(fptr)
if removals:
for removal in removals:
page.remove(removal)
Also convenient: re-index all METS-Filegroups after any undesired reference entries were dropped.
My largest demand for a sanitizer would be ensuring ingest into Kitodo.Presentation / DFG-Viewer works.
According to this we are already close, but...
/alto/Layout/Page/@WIDTH
is extremely important, because Kitodo.Presentation needs to add the DFG footer (which comes in multiples of 1000px width IIUC) and therefore scales the images and thus needs to know by what amount to scale the ALTO coordinates accordinglyDEFAULT
fileGrp (whether by alias to another, existing fileGrp or by renaming I am not sure)FULLTEXT
fileGrp (not sure what to do if multiple versions are available) and MIMETYPE="text/html"
(not application/alto+xml
!)LOCTYPE="URL"
(but not sure about the kind of response the webserver needs to give, esp. whether it must understand and convey the correct Content-Type
MIME or may omit it or use some nonsense like application/octet-stream
)mets:file
there must be exactly one FLocat
(which was already discussed within the remote-local bookkeeping and partial manifestation idea)structMap
of TYPE="PHYSICAL"
with a mets:div
of TYPE="physSequence"
in it and at least one mets:div
in that with TYPE="page"
(i.e. at least one page) and a ORDER
labelstructMap
of TYPE="LOGICAL"
with a mets:div
of some TYPE
in it ("the name is not important") and at least one mets:div
in that with TYPE
among these labelsstructLink
linking each physical page to at least one logical elementmets:dmdSec
with at least some MODS or TEIHDR metadatamets:amdSec
with at least some mets:techMD
or external namespace metadata and some mets:rightsMD
(with various dv:rights
specs) and mets:digiprovMD
(with dv:reference
)I stand corrected: As this example by @stefanCCS – METS and ALTO – shows, MIMETYPE="application/alto+xml"
and ALTO v4.1 do work actually. (That is, newer features are simply ignored.)
METS/PAGE/ALTO provided by digitization workflow software or repositories will not always adhere to the conventions we have in OCR-D. OTOH the workspaces that are the result of OCR-D workflows contains a lot of redundant information that is not relevant for ingestion into production systems or contradict the local conventions of the production system.
Also, our conventions have been shifting and will continue to do so to meet the needs of users and developers.
Many users therefore have developed scripts to preprocess input and postprocess output of OCR-D.
OCR-D/core should provide a processor
ocrd-sanitize
which is only concerned with "housekeeping" of workspaces. Possible actions include:mets:fileGrp
, either by allowlist or denylist. I.e. removemets:fileGrp
and containingmets:file
(and files on disk) that are not required anymorexlink:href
to match local conventionspage:TextEquiv
information in PAGE-XMLThese are just some ideas, we'd love to hear yours. Please share your post-processing/post-processing scripts or feature requests for such a tool so we can develop a solution together for common tasks.