OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

Relation of METS and PAGE ReadingOrder #40

Open kba opened 6 years ago

kba commented 6 years ago

We need to specify how these constructs are related, which one to use, how to handle contradictions.

kba commented 6 years ago

c.f. https://github.com/OCR-D/spec/issues/55

wrznr commented 6 years ago

After discussing this issue with @tboenig: Reading order is not represented within METS since it is a page-level datum.

wrznr commented 6 years ago

However, we find examples of reading orders represented in METS, e.g., within the DDR-Presseportal:

<mets:div TYPE="article-part" ORDER="1" ID="article6-1">
                    <mets:div TYPE="article-zone" LABEL="title" ID="article6-zone1">
                        <mets:fptr>
                            <mets:area COORDS="194,886,658,170" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block18" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone2">
                        <mets:fptr>
                            <mets:area COORDS="183,1082,670,203" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block19" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone3">
                        <mets:fptr>
                            <mets:area COORDS="186,1290,673,559" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block20" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone4">
                        <mets:fptr>
                            <mets:area COORDS="189,1864,658,145" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block21" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                </mets:div>
kba commented 6 years ago

How can you represent document structure? <mets:file mimetype="application/tei+xml">...</mets:file>?

cneud commented 6 years ago

This was also a topic in Europeana Newspapers. See e.g.
http://www.primaresearch.org/publications/ICDAR2013_Clausner_ReadingOrder
http://www.europeana-newspapers.eu/wp-content/uploads/2015/05/D5.3_Final_release_ENMAP_1.0.pdf

wrznr commented 5 years ago

@kba Proposal for OCR-D purposes: <mets:structMap TYPE="LOGICAL" /> is the place to represent document structure (i.e. all structural phenomena which may cross page boundaries). <pc:ReadingOrder /> is the place to store page-internal reading order.

wrznr commented 5 years ago

@tboenig We should update the guidelines asap.

wrznr commented 5 years ago

@tboenig Push.

cneud commented 5 years ago

This is only awaiting the updated guidelines, right?

80 is closed and I agree fully with https://github.com/OCR-D/spec/issues/40#issuecomment-421994713.

For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE ReadingOrder.

A solution for METS/MODS structural enrichment via external information available through our standard fileGrp mechanism is therefore imho the best solution for now.

kba commented 4 years ago

Possibly fixed by #154

bertsky commented 2 years ago

Possibly fixed by #154

superseded by #207, but unrelated AFAICS

For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE ReadingOrder.

A solution for METS/MODS structural enrichment via external information available through our standard fileGrp mechanism is therefore imho the best solution for now.

Page-local reading order and structure is important both on its own, and as a contributor to document structure.

The latter (i.e. structure across pages like section boundaries and cross-refs/indexes) cannot be adequately represented in fileGrps, though. The only place for that is still the logical structMap IMHO. So far, we have two conventions for its representation:

The second convention is of course more powerful and general, but not as widely used.

In fact, is has been somewhat forgotten even in the context of newspaper digitization, as even DDB Zeitungsportal shied away from adopting it so far – despite listing the recording of article structure as task in its grant proposal (AP 6 p.10) and in its master planning (Tiefenerschließung Artikelebene, p. 20). The latter document references ENMAP specifically, giving it a certain spin:

ENMAP ist ein METS/ALTO-Profil für Zeitungen das vom Europeana-Newspapers-Projekt entwickelt wurde und das insbesondere nützliche Hinweise für eine Feinstrukturierung der formalen und inhaltlichen Zeitungsbestandsteile enthält. Bitte beachten Sie jedoch, dass aufwendige Feinstrukturierungen möglicherweise ausschließlich in lokalen Umgebungen Mehrwerte erbringen und in überregionalen Nachweisinstrumenten (z.B. DDB, Europeana) nicht nachgenutzt werden können.

So we can see there is a hen-vs-egg problem here: automatic structural tagging is still hard (although tools for visualizing and detecting article structure are getting better), hence enriched datasets are rare, therefore training is difficult. Not having everyone commit to the existing, agreed upon unified representation makes this even more difficult.

But it's not just a matter of simply adopting the ENMAP spec: IMO it is not trivially compatible with the DFG profile.

However this will be resolved, I do think it is worth pursuing some form of documentation and specification already – as enabler for tool developers and data providers.

(For example, we could simply write some OCR-D processor extracting OLR results with headings and reading order into "coarse" document structure in either DFG-profile / mets:structLink or ENMAP / mets:area form already.)