Open kba opened 5 years ago
Option 2 is problematic like this: the current schema does not allow arbitrary type
strings – one would have to use type="other" name="model-used" value="frm"
instead.
And another variant of option 2 is storing all runtime parameters of all previous processors/annotators in a pg:Labels
sub-element, as can be seen here, BTW. For example:
<pc:MetadataItem type="processingStep" name="recognition/text-recognition" value="ocrd-tesserocr-recognize">
<pc:Labels externalModel="parameters">
<pc:Label value="glyph" type="textequiv_level"/>
<pc:Label value="deu-frak" type="model"/>
</pc:Labels>
</pc:MetadataItem>
Right, I forgot that you already implemented this. The example above was just a basic draft.
I would be fine with formalizing your solution here in spec, so it can be consistently implemented across projects.
I was looking for this and https://github.com/OCR-D/spec/issues/108#issuecomment-503147346 looks good to me.
I am fine with what @bertsky proposed in https://github.com/OCR-D/spec/issues/108#issuecomment-503147346 too.
Widely implemented but not defined in our spec. Should be remedied.
Since https://github.com/OCR-D/core/pull/747 also the METS part is implemented in a general way (and should be standardized).
In our last "OCR(-D) & Co" (1st July 2022) call we talked about provenence and we discussed the following ideas for documenting the creation path:
See also PR https://github.com/OCR-D/spec/pull/126 about Provenance.
- Currently, provenance information is stored in the
mets:agent/mets:note
elements, as well as in the ALTO XMLProcessing/processingStepSettings
elements (kitodo_production_ocrd).
The latter is true only because our page-to-alto converter knows our convention for PAGE-XML provenance under MetadataItem/Labels
, which is the primary information.
The problem here is, that the information is spread over different elements and documents and a re-run will create new/additional elements.
It is a form of redundancy, not of spread: Both the PAGE-XML and the METS-XML provenance are complete (as far as the workflow is concerned). However, naturally, in the PAGE-XML provenance, you cannot infer the fileGrp(s) where the input came from, whereas in the METS-XML provenance, you can.
And the observation about repeating provenance elements when repeating processors only applies to the METS-XML side. However, it is not a problem IMO. On the contrary, even when your provenance is another format (like Nextflow), you'd need to know when workflows have been (partially) repeated.
- Our idea in the call was, to store a referenced to the Nextflow workflow descriptions (e.g. in the mets:agent) or embedded the description as part of the METS mets:fileGrp, to prove the genesis. This makes it possible to trace how the results came about.
I don't see how the currently implemented descriptive format is not tracable/machine-actionable. Also, I would argue against a format of any specific workflow engine like Nextflow: You can always run single processors, too. It should not be relevant how you ran the workflow, the provenance should always look the same. (Or, if you insist on the Nextflow format, then every standalone processor run should be wrapped as single-step Nextflow workflow.)
Also, I can see why you would want to subsume the information under the mets:fileGrp instead of the global metsHdr, but that actually makes reading the workflow for humans much more difficult, and provides no additional information (since the fileGrp names are already included as input and output currently).
But perhaps a better way to represent this kind of information is in a amdSec/digiprovMD
(with mdWrap
or mdRef
to a local workflow file). The actual data (be it Nextflow or our currently implemented XML-NS https://ocr-d.e
ad-hoc schema) could contain everything we know about the run, including revisions, versions, dates and checksums.
- And another idea was to store also the logs in this way (mets:fileGrp) to complete picture
I agree, it would be beneficial to include the logs (following the single source of truth idea).
The problem here is size (so the logs should not be included verbatim but referenced as files), and granularity (some information is file-specific, some is fileGrp-specific and some even regards multiple fileGrps at the same time when there are multiple outputs – and processors do not emit that kind of information).
It might be tempting to just add another fileGrp (say LOGS
) for this where each file entry contains an FLocat to the log output of that run. However, this might clash with METS usage outside of OCR-D. Also, it would be better to use a representation which implicitly ties workflow/provenance and logs – as you suggested.
So I would propose to use amdSec/digiprovMD
for that as well: with mdRef
to LOCTYPE=OTHER OTHERLOCTYPE=FILE
hrefs of the local log files. This way, the same metadata element could describe both the workflow/run and its corresponding log results.
- The NF WF descriptions should be kept in a repository for this purpose and are ideally versioned
So you mean the workflows should be described by reference only, not by value? That would be the first instance where we reference content outside the workspace directory in the METS.
How do we encode that a specific component used a specific model?
For example, ocrd_tesserocr uses
frm
model, where do we store that information?This should be part of the overall provenance model but we do not have that yet in place and it's unclear if and how module projects would interact with it.
Options I can see are:
1) as part of the
mets:agent
for that component 2) as apg:MetadataItem
in the PAGE-XML documentmets:agent
mets:note
isn't the most semantically meaningful of elements but mets:agent can only havemets:name
andmets:note
. Using JSON in a catch-all XML element is probably not ideal.pg:MetadataItem