Open kba opened 6 years ago
After discussing this issue with @tboenig: Reading order is not represented within METS since it is a page-level datum.
However, we find examples of reading orders represented in METS, e.g., within the DDR-Presseportal:
<mets:div TYPE="article-part" ORDER="1" ID="article6-1">
<mets:div TYPE="article-zone" LABEL="title" ID="article6-zone1">
<mets:fptr>
<mets:area COORDS="194,886,658,170" SHAPE="RECT" FILEID="default1"/>
</mets:fptr>
<mets:fptr>
<mets:area BETYPE="IDREF" BEGIN="block18" FILEID="alto1"/>
</mets:fptr>
</mets:div>
<mets:div TYPE="article-zone" LABEL="body" ID="article6-zone2">
<mets:fptr>
<mets:area COORDS="183,1082,670,203" SHAPE="RECT" FILEID="default1"/>
</mets:fptr>
<mets:fptr>
<mets:area BETYPE="IDREF" BEGIN="block19" FILEID="alto1"/>
</mets:fptr>
</mets:div>
<mets:div TYPE="article-zone" LABEL="body" ID="article6-zone3">
<mets:fptr>
<mets:area COORDS="186,1290,673,559" SHAPE="RECT" FILEID="default1"/>
</mets:fptr>
<mets:fptr>
<mets:area BETYPE="IDREF" BEGIN="block20" FILEID="alto1"/>
</mets:fptr>
</mets:div>
<mets:div TYPE="article-zone" LABEL="body" ID="article6-zone4">
<mets:fptr>
<mets:area COORDS="189,1864,658,145" SHAPE="RECT" FILEID="default1"/>
</mets:fptr>
<mets:fptr>
<mets:area BETYPE="IDREF" BEGIN="block21" FILEID="alto1"/>
</mets:fptr>
</mets:div>
</mets:div>
How can you represent document structure? <mets:file mimetype="application/tei+xml">...</mets:file>
?
@kba Proposal for OCR-D purposes:
<mets:structMap TYPE="LOGICAL" />
is the place to represent document structure (i.e. all structural phenomena which may cross page boundaries).
<pc:ReadingOrder />
is the place to store page-internal reading order.
@tboenig We should update the guidelines asap.
@tboenig Push.
This is only awaiting the updated guidelines, right?
For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE ReadingOrder
.
A solution for METS/MODS structural enrichment via external information available through our standard fileGrp
mechanism is therefore imho the best solution for now.
Possibly fixed by #154
Possibly fixed by #154
superseded by #207, but unrelated AFAICS
For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE
ReadingOrder
.A solution for METS/MODS structural enrichment via external information available through our standard
fileGrp
mechanism is therefore imho the best solution for now.
Page-local reading order and structure is important both on its own, and as a contributor to document structure.
The latter (i.e. structure across pages like section boundaries and cross-refs/indexes) cannot be adequately represented in fileGrps, though. The only place for that is still the logical structMap IMHO. So far, we have two conventions for its representation:
mets:div
with Strukturdatenset structural types, which are linked to the physical file structure via mets:structLink
(i.e. only page-level granularity)mets:area
as exemplified above, allowing for direct references into page segments (either in the form of @COORDS
or via idref-typed @BEGIN
pointers into ALTO or PAGE segments)The second convention is of course more powerful and general, but not as widely used.
In fact, is has been somewhat forgotten even in the context of newspaper digitization, as even DDB Zeitungsportal shied away from adopting it so far – despite listing the recording of article structure as task in its grant proposal (AP 6 p.10) and in its master planning (Tiefenerschließung Artikelebene, p. 20). The latter document references ENMAP specifically, giving it a certain spin:
ENMAP ist ein METS/ALTO-Profil für Zeitungen das vom Europeana-Newspapers-Projekt entwickelt wurde und das insbesondere nützliche Hinweise für eine Feinstrukturierung der formalen und inhaltlichen Zeitungsbestandsteile enthält. Bitte beachten Sie jedoch, dass aufwendige Feinstrukturierungen möglicherweise ausschließlich in lokalen Umgebungen Mehrwerte erbringen und in überregionalen Nachweisinstrumenten (z.B. DDB, Europeana) nicht nachgenutzt werden können.
So we can see there is a hen-vs-egg problem here: automatic structural tagging is still hard (although tools for visualizing and detecting article structure are getting better), hence enriched datasets are rare, therefore training is difficult. Not having everyone commit to the existing, agreed upon unified representation makes this even more difficult.
But it's not just a matter of simply adopting the ENMAP spec: IMO it is not trivially compatible with the DFG profile.
However this will be resolved, I do think it is worth pursuing some form of documentation and specification already – as enabler for tool developers and data providers.
(For example, we could simply write some OCR-D processor extracting OLR results with headings and reading order into "coarse" document structure in either DFG-profile / mets:structLink
or ENMAP / mets:area
form already.)
We need to specify how these constructs are related, which one to use, how to handle contradictions.