OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

restrict fileGrp names more #746

Closed bertsky closed 2 years ago

bertsky commented 2 years ago

Although the spec requires fileGrp/@USE names to follow a very strict scheme, we have not enforced this in core (only the workspace validator checks it). However, if fileGrp names are left completely unrestricted, we get follow-up problems: For example, since we normally base file IDs on fileGrp names, some user choices will unwittingly end up in invalid METS:

element file: Schemas validity error : Element '{http://www.loc.gov/METS/}file', attribute 'ID': 'OCR-D-OCR-TESS-Fraktur+Latin-SEG-LINE-tesseract-ocropy-DEWARP_0005' is not a valid value of the atomic type 'xs:ID'.
element file: Schemas validity error : Element '{http://www.loc.gov/METS/}file', attribute 'ID': 'OCR-D-GT-SEG-PAGE-ſs-sſ-EVAL_0006' is not a valid value of the atomic type 'xs:ID'.
...
element fptr: Schemas validity error : Element '{http://www.loc.gov/METS/}fptr', attribute 'FILEID': 'OCR-D-OCR-TESS-Fraktur+Latin-SEG-LINE-tesseract-ocropy-DEWARP_0005' is not a valid value of the atomic type 'xs:IDREF'.
element fptr: Schemas validity error : Element '{http://www.loc.gov/METS/}fptr', attribute 'FILEID': 'OCR-D-GT-SEG-PAGE-ſs-sſ-EVAL_0006' is not a valid value of the atomic type 'xs:IDREF'.

I therefore suggest extending add_file's https://github.com/OCR-D/core/blob/d9f660ee727c5235813e7f1534e26f2bebe483d3/ocrd_models/ocrd_models/ocrd_mets.py#L298-L299 check to add_file_grp.

kba commented 2 years ago

Agreed.

We should probably also make this explicit in the spec, since we do not require the naming schema (SHOULD not MUST) but we should add that "mets:fileGrp/@USE MUST be a valid xs:ID.

bertsky commented 2 years ago

Indeed.