These are then referenced under each physical structMap's page via @DMDID.
IMO in core we first need some additional API to support that. Like (in analogy to pageId):
OcrdMets.get_gt_labelling(self, for_fileIds=None) # returns dict of file ID to label list
OcrdMets.get_gt_labelling_for_file(self, ocrd_file) # returns label list
OcrdMets.set_gt_labelling_for_file(self, labels, ocrd_file) # takes label list
# but also:
OcrdMets.add_file(self, ... labels=None, ...) # add full label list
OcrdMets.find_files(self, ... labels=None, ...) # filter by label list (match any)
What's your opinion, @kba?
Perhaps – instead of parsing this from the METS, we could also see to it that OCR-D mirrors them in the parsed PAGE-XML, i.e. OcrdPage.
Note: In METS, the labels are a flat sequence of
gt:state
elements with@prop
from the above mentioned schema file, one per page.These are then referenced under each physical structMap's page via
@DMDID
.IMO in core we first need some additional API to support that. Like (in analogy to pageId):
What's your opinion, @kba?
Perhaps – instead of parsing this from the METS, we could also see to it that OCR-D mirrors them in the parsed PAGE-XML, i.e.
OcrdPage
.For example as:
This would make it easier to access the labels from a processor or PAGE viewer.
Originally posted by @bertsky in https://github.com/hnesk/browse-ocrd/issues/36#issuecomment-1015224550