OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

allow global fptrs in structMap #142

Open bertsky opened 4 years ago

bertsky commented 4 years ago

Currently, we only specify how to describe the hierarchy of pages (represented by a set of files under mets:structMap/mets:div/mets:div) and their order. But nothing so far on logical structure across pages (table of contents, via mets:structMap TYPE="LOGICAL" and mets:smLink) or other global representations. For the latter METS offers adding a set of files (mets:fptrs) directly under the top mets:structMap/mets:div – this would allow pointing to other file formats representing the complete document on some level, e.g.:

These could each reside in dedicated mets:fileGrps, so (with a little work) even our notion of workspace processor (reading from input file groups and writing to output file groups) could be generalized to allow encapsulating such converters within normal OCR-D CLIs.

For example, schematically, we could have:

<mets:fileGrp USE="OCR-D-IMG">
  <mets:file MIMETYPE="image/tiff" ID="TIFF-input">
    <mets:FLocat .../>
  </mets:file>
  <mets:file MIMETYPE="image/tiff" ID="TIFF-split_0001">
    <mets:FLocat .../>
  </mets:file>
  <mets:file MIMETYPE="image/tiff" ID="TIFF-split_0002">
    <mets:FLocat .../>
  </mets:file>
  ...
</mets:fileGrp>
<mets:fileGrp USE="OCR-D-TEI">
  <mets:file MIMETYPE="application/tei+xml" ID="TEI-output">
    <mets:FLocat .../>
  </mets:file>
</mets:fileGrp>
<mets:fileGrp USE="OCR-D-PDF">
  <mets:file MIMETYPE="application/pdf" ID="PDF-output">
    <mets:FLocat .../>
  </mets:file>
</mets:fileGrp>
...
<mets:structMap TYPE="PHYSICAL">
  <mets:div ID="physroot" TYPE="physSequence">
    <mets:div ID=... ORDER=... TYPE="page">
      <mets:fptr FILEID="TIFF-split_0001"/>
      <mets:fptr FILEID="OCR-D-SEG-PAGE_0001"/>
      <mets:fptr .../>
    </mets:div>
    ...
    <mets:fptr FILEID="TIFF-input"/>
    <mets:fptr FILEID="TEI-output"/>
    <mets:fptr FILEID="PDF-output"/>
  </mets:div>
</mets:structMap>

Related: #40. (The issue of conventions for logical structure has also been touched by #80 and OCR-D/core#304 before.)

wrznr commented 4 years ago

@tboenig Pls. have a look!

kba commented 11 months ago

Even thought they are related, I think we should separate the (more complex) issue of support for logical structMaps from document-wide files. The latter seems more urgent, especially after our discussions about the METS conventions at SLUB, ULBH and SBB.

Currently, adding a file to a workspace without specifying a page ID adds the file to the fileGrp but adds no entry to the structMap[@TYPE="PHYSICAL"]. This is inconsistent with how e.g. SLUB provides document-wide PDFs. So I think we should specify that a file added without page ID MUST be added with a mets:fptr before the first page of the physical structMap. And we should implement that behavior in core.

I agree on the METS structure you propose, with two exceptions:

<mets:fileGrp USE="OCR-D-IMG">
  <mets:file MIMETYPE="image/tiff" ID="TIFF-input">
    <mets:FLocat .../>
  </mets:file>
  <mets:file MIMETYPE="image/tiff" ID="TIFF-split_0001">
    <mets:FLocat .../>
  </mets:file>
  <mets:file MIMETYPE="image/tiff" ID="TIFF-split_0002">
    <mets:FLocat .../>
  </mets:file>
  ...
</mets:fileGrp>

I don't think it's a good idea to mix multi-page and single-page documents in the same fileGrp. It's not a problem if one provides an explicit page (range) because in that case, the files to be processed are based on the page-specific part of the physical structMap and would omit the TIFF-input file. But if no page range is given, every file in that file group would be processed, including the multi-page TIFF-input. So I think, this should be in a separate fileGrp.

<mets:structMap TYPE="PHYSICAL">
  <mets:div ID="physroot" TYPE="physSequence">
    <mets:div ID=... ORDER=... TYPE="page">
      <mets:fptr FILEID="TIFF-split_0001"/>
      <mets:fptr FILEID="OCR-D-SEG-PAGE_0001"/>
      <mets:fptr .../>
    </mets:div>
    ...
    <mets:fptr FILEID="TIFF-input"/>
    <mets:fptr FILEID="TEI-output"/>
    <mets:fptr FILEID="PDF-output"/>
  </mets:div>
</mets:structMap>

Minor thing but I think the files should be at the top, before the first mets:div[@TYPE="page"] for consistency with how e.g. SLUB handles this case already.

bertsky commented 11 months ago

Even thought they are related, I think we should separate the (more complex) issue of support for logical structMaps from document-wide files.

Agreed. One at a time.

Currently, adding a file to a workspace without specifying a page ID adds the file to the fileGrp but adds no entry to the structMap[@TYPE="PHYSICAL"].

Right. But you can do it with the API (workspace.add_file) or CLI (workspace add) in a processor. (E.g. ocrd-segment-evaluate does it. Even bashlib-based processors could, but ocrd-pagetopdf uses Python as well.)

On the other hand, there is (so far) no API to add anything to the logical structmap (if you are so inclined), yet. You can of course try to traverse the existing structmap and insert elements at the right level via lxml – but that's 1. hard to get implemented correctly (there are lots of different cases for the top levels, depending on the item type) and 2. hard to read/share and maintain.

So, we have:

The question of course is: which one is correct (for lack of a better word)?

On that point @M3ssman and I already had an interesting discussion in digital-derivans:

Another problem: the PDF gets referenced as fptr in the logical structMap. That's plain wrong according to DFG profile – it should be in the physical structMap.

What DFG- profile do you mean?

I meant the DFG profile for METS. But now that I went looking, surprisingly I cannot find any specifics for PDF in there, except for the mention of the dedicated DOWNLOAD fileGrp.

It did enter the OCR-D spec on METS though. There it says to use fptr in the top-level div of the physical structMap.

Looking at the code base for DFG Viewer, Kitodo.Presentation, it appears like both are supported: fptr under physical and fptr under logical.

I am somewhat perplexed. How come this important detail never entered any official documentation?

Now, if pointing to already existing spec language on the subject does not ring the alarm for you, I don't know what to say. Somehow, this entire discussion (to which admittedly I did contribute, along with @tboenig) went completely parallel, and it never actually affected core (there's not even an issue tracking new behaviour like FULLDOWNLOAD_*).

Thus IMHO first we should discuss whether we really want the existing formulation (esp. around the FULLDOWNLOAD_* identifiers, which as it stands are not even compatible to what SLUB does and seem quite restrictive). And how we stand on physical vs. logical for global files.

Finally, addressing your last points...

I don't think it's a good idea to mix multi-page and single-page documents in the same fileGrp.

Why not? A fileGrp for PDFs could have both single-page files and a global file (this is what SLUB does), likewise for TEI. Or take evaluation reports (where you usually want to have both single-page and aggregated views).

I feel like we should try not to put too much of our own assumptions in here. If SLUB already does that kind of thing, perhaps others made the same choices.

It's not a problem if one provides an explicit page (range) because in that case, the files to be processed are based on the page-specific part of the physical structMap and would omit the TIFF-input file. But if no page range is given, every file in that file group would be processed, including the multi-page TIFF-input. So I think, this should be in a separate fileGrp.

Ok, but that's a subordinate problem. Sure, for our processors, it must be crystal clear how this case should be handled. But PDF and TEI is never on the input side of a processor. And if we do manage to find a solution for the ambiguity of global vs single-page files on the input side (like say via an option for the processor implementor), then we will also be able to accept multi-page TIFF input directly. (Just saying.)

Minor thing but I think the files should be at the top, before the first mets:div[@TYPE="page"] for consistency with how e.g. SLUB handles this case already.

Agreed.