put together the provenance info needed for each package/objects

kmexter commented 2 years ago

According to what I can remember about the common provenance model of EOSC Life WP6, the recommendations are that the following needs to be provided for each file as a digital object:

who/what provided it (ORCID, project name and URL, what software process executed the "get" or "create" or "update")
was this a "get" a "create" an "update" etc
where it was gotten from (URL(s) ideally) or how was it created (e.g. a software process)
its provenance from the place it was obtained from (e.g. URL pointing to a metadata record if the dataset was taken from a metadata record, and internally for us if we e.g. merge files, then also URLs pointing to the provenance files in github that belong to the files that were merged)
licence and more general access rights
who controls it (i.e. who to contact about it)
modification remarks if this is an update (and ideally with some of the remarks taken from a vocab, so it is clear if this is an "original copy", or a "updated data" or whatever, to machines as well as humans)

These provenance information can be packaged in a prov ro-crate we can create, but/also written in prov-o following the CPM of WP6 (my notes about this can be found on confluence: https://confluence.vliz.be/display/VMDCOS/2022-07-08+Vienna+ISO+pt+3+meeting and https://confluence.vliz.be/display/VMDCOS/Reading+on+provenance+in+marine+biology and 2 papers that I am not allowed to share digitally but which I have printed out and on my desk :-})

kmexter commented 2 years ago

Then additionally, the provenance for biological material and its digital "derivatives", we will need provenance information following the EMBRC "provenance model" that we are building in WP6. This will cover the metadata necessary for each spreadsheet from a single station/sampling event, the digital files (e.g. the sequences, the ARMS images), also the biobanked material (especially if the stations don't do this properly!). Since Laurian and I will not have time to put this model together until Oct/Nov, I think that for this part of the provenance, we will have to wait until then. What we can do before then, perhaps, is decide how we will store these metadata. Ideally not as CSV files (data.csv and metadata.csv), because that is just too clunky for the amount of digital data that will need to be managed. We will need to create a template that can be (ideally) automatically filled, and which L and K can do as part of our EMBRC prov model work.

cedricdcc commented 2 years ago

Can be made into an action that can be applied to a github repo , doesn't matter if the repo is a RO-Crate or not. @marc-portier thoughts?

responsibility can be placed on the original author of the file + contact info is gh account.
license is the over-arching repo license.
output format should be produced according to input params.
Search for existing actions that already make the provenance from a given repo.

laurianvm commented 2 years ago

prov-o link: https://www.w3.org/TR/prov-o/

kmexter commented 1 month ago

Still on my list of things to do end nov

emo-bon / governance-data

put together the provenance info needed for each package/objects #7