OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

Storing information on used processor parameters in the METS and PAGE #108

Open kba opened 5 years ago

kba commented 5 years ago

How do we encode that a specific component used a specific model?

For example, ocrd_tesserocr uses frm model, where do we store that information?

This should be part of the overall provenance model but we do not have that yet in place and it's unclear if and how module projects would interact with it.

Options I can see are:

1) as part of the mets:agent for that component 2) as a pg:MetadataItem in the PAGE-XML document

mets:agent

 <mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE="recognition/text-recognition">
   <mets:name>ocrd-tesserocr-recognize v0.1.3</mets:name>
   <mets:note>{"model-used": "frm"}</mets:note>
 </mets:agent>

mets:note isn't the most semantically meaningful of elements but mets:agent can only have mets:name and mets:note. Using JSON in a catch-all XML element is probably not ideal.

pg:MetadataItem

<pg:MetadataItem type="model-used" value="frm"/>
bertsky commented 5 years ago

Option 2 is problematic like this: the current schema does not allow arbitrary type strings – one would have to use type="other" name="model-used" value="frm" instead.

And another variant of option 2 is storing all runtime parameters of all previous processors/annotators in a pg:Labels sub-element, as can be seen here, BTW. For example:

<pc:MetadataItem type="processingStep" name="recognition/text-recognition" value="ocrd-tesserocr-recognize">
            <pc:Labels externalModel="parameters">
                <pc:Label value="glyph" type="textequiv_level"/>
                <pc:Label value="deu-frak" type="model"/>
            </pc:Labels>
        </pc:MetadataItem>
kba commented 5 years ago

Right, I forgot that you already implemented this. The example above was just a basic draft.

I would be fine with formalizing your solution here in spec, so it can be consistently implemented across projects.

mikegerber commented 5 years ago

I was looking for this and https://github.com/OCR-D/spec/issues/108#issuecomment-503147346 looks good to me.

cneud commented 4 years ago

I am fine with what @bertsky proposed in https://github.com/OCR-D/spec/issues/108#issuecomment-503147346 too.

kba commented 4 years ago

Widely implemented but not defined in our spec. Should be remedied.

bertsky commented 2 years ago

Since https://github.com/OCR-D/core/pull/747 also the METS part is implemented in a general way (and should be standardized).

j-panzer commented 2 years ago

In our last "OCR(-D) & Co" (1st July 2022) call we talked about provenence and we discussed the following ideas for documenting the creation path:

See also PR https://github.com/OCR-D/spec/pull/126 about Provenance.

bertsky commented 2 years ago
  • Currently, provenance information is stored in the mets:agent/mets:note elements, as well as in the ALTO XML Processing/processingStepSettings elements (kitodo_production_ocrd).

The latter is true only because our page-to-alto converter knows our convention for PAGE-XML provenance under MetadataItem/Labels, which is the primary information.

The problem here is, that the information is spread over different elements and documents and a re-run will create new/additional elements.

It is a form of redundancy, not of spread: Both the PAGE-XML and the METS-XML provenance are complete (as far as the workflow is concerned). However, naturally, in the PAGE-XML provenance, you cannot infer the fileGrp(s) where the input came from, whereas in the METS-XML provenance, you can.

And the observation about repeating provenance elements when repeating processors only applies to the METS-XML side. However, it is not a problem IMO. On the contrary, even when your provenance is another format (like Nextflow), you'd need to know when workflows have been (partially) repeated.

  • Our idea in the call was, to store a referenced to the Nextflow workflow descriptions (e.g. in the mets:agent) or embedded the description as part of the METS mets:fileGrp, to prove the genesis. This makes it possible to trace how the results came about.

I don't see how the currently implemented descriptive format is not tracable/machine-actionable. Also, I would argue against a format of any specific workflow engine like Nextflow: You can always run single processors, too. It should not be relevant how you ran the workflow, the provenance should always look the same. (Or, if you insist on the Nextflow format, then every standalone processor run should be wrapped as single-step Nextflow workflow.)

Also, I can see why you would want to subsume the information under the mets:fileGrp instead of the global metsHdr, but that actually makes reading the workflow for humans much more difficult, and provides no additional information (since the fileGrp names are already included as input and output currently).

But perhaps a better way to represent this kind of information is in a amdSec/digiprovMD (with mdWrap or mdRef to a local workflow file). The actual data (be it Nextflow or our currently implemented XML-NS https://ocr-d.e ad-hoc schema) could contain everything we know about the run, including revisions, versions, dates and checksums.

  • And another idea was to store also the logs in this way (mets:fileGrp) to complete picture

I agree, it would be beneficial to include the logs (following the single source of truth idea).

The problem here is size (so the logs should not be included verbatim but referenced as files), and granularity (some information is file-specific, some is fileGrp-specific and some even regards multiple fileGrps at the same time when there are multiple outputs – and processors do not emit that kind of information).

It might be tempting to just add another fileGrp (say LOGS) for this where each file entry contains an FLocat to the log output of that run. However, this might clash with METS usage outside of OCR-D. Also, it would be better to use a representation which implicitly ties workflow/provenance and logs – as you suggested.

So I would propose to use amdSec/digiprovMD for that as well: with mdRef to LOCTYPE=OTHER OTHERLOCTYPE=FILE hrefs of the local log files. This way, the same metadata element could describe both the workflow/run and its corresponding log results.

  • The NF WF descriptions should be kept in a repository for this purpose and are ideally versioned

So you mean the workflows should be described by reference only, not by value? That would be the first instance where we reference content outside the workspace directory in the METS.