OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

OcrdMets: add generateDS model of MODS as new OcrdMods class #931

Open bertsky opened 1 year ago

bertsky commented 1 year ago

For processors consuming MODS metadata, it would help (as in: easier and more efficient code) being able to use the Python object model. For example, querying language or script by XPath is painful.

The interface could be something like ocrd_mets.OcrdMets.dmdSec (as a dict of IDs to ocrd_mods.OcrdMods instances).

Remotely related: #783

kba commented 1 year ago

@bertsky in https://github.com/OCR-D/core/pull/966#pullrequestreview-1261544355 (posting here so does not get lost when resolving that discussion):

Moreover, what about MODS queries? ATM it's only a minor use-case (ocrd-segment-extract-lines wants to know the mods:recordIdentifier). But IIUC this will be the only way processors can query meta-data (whether passed from manual input or previous processors). So IMO we must (at some point, not necessarily right now) provide some OcrdMods and wrap that object via HTTP as well, e.g. in OcrdMets:

@property
def mods(self):
    return parsexml(...)

and then wrapping a /mods entry point in OcrdMetsServer and then in ClientSideOcrdMets:

@property
def mods(self):
    r = self.session.request('GET', f'{self.url}/mods')
    return r.json()
bertsky commented 1 year ago

Yes, and an OcrdMods would also be needed if we were to extend #698 (automatic inheritance in OcrdPage hierarchy) with the document-wide lang/script features.

bertsky commented 9 months ago

Yes, and an OcrdMods would also be needed if we were to extend #698 (automatic inheritance in OcrdPage hierarchy) with the document-wide lang/script features.

However, this could also be achieved via a dedicated (specialised) processor (which merely fills page-level lang/script from the MODS)...

bertsky commented 9 months ago

Valuable functionality that could be reused for OcrdMods can also be found in: