Combine work of OpenDS and TDWG Attribution IG

The following are emails related to this issue:

From Anne Thessen 15 Nov 2023

Hello! I've been reviewing my notes from TDWG and reading about the details of OpenDS, including the GitHub repo. This, of course, is very relevant to the work of the TDWG/RDA attribution working group and you use these recommendations in OpenDS. The TDWG Attribution Interest Group was initially working on a DwC agent extension, but in light of OpenDS, David and I agree that this is not the way to go. We think that supporting attribution and provenance within OpenDS would be more beneficial. After reviewing the OpenDS GitHub repo, I have the following question/suggestion: For the agents.json -> Would you like to use a controlled vocabulary for role? Might I suggest the Contributor Role Ontology? David and I have already added some biodiversity relevant roles and these are also reflected in VIVO. The TDWG Attribution IG could focus on building out the CRO to support the needs of OpenDS and DiSSCo. If you agree, we can refocus this IG on this task.

An unrelated question..... Any interest in representing the mappings (in the harmonization folder) in SSSOM?

From David Shorthouse 15 Nov 2023 Anne,

Thanks for renewing dialog here & my apologies for being silent. A couple things of note here, should they be useful to consider.

the agent actions extension to DwC, originally envisioned to be an extension in a DwC-A: https://github.com/tdwg/attribution/issues. The primary motivation here was to tease apart the verb-based actions implicit in terms like dwc:recordedBy in support of a menu of many others such that an agent could be uniquely represented as having participated in the handling & trajectory of any one material sample. So, you'll see a vocabulary of terms, https://github.com/tdwg/attribution/issues?q=is%3Aissue+is%3Aopen+label%3Aaction-vocabulary that could do with a formal home in a Contributor Role Ontology if the latter can represent these equivalent interests. Mathias Dillen @ Meise Botanic Garden has shown renewed interest in the thinking behind this extension & so I have included him here.
elements of GBIF's proposed unified data model in relation to attribution may need refinement as a more "unified" superset with accommodation for CRO &/or verb-based actions, https://www.gbif.org/new-data-model. Webinars and case studies are available on https://www.gbif.org/new-data-model but I have not examined each of these in sufficient detail to offer suggestions
other places where there is a potential foothold for attribution are nanopublications (see https://blog.pensoft.net/2023/09/12/nanopublications-tailored-to-biodiversity-data/) or even the W3C Web Annotations for which its "motivations" can likewise be considered verb-based actions (see Matthis' & Andreas' preprint: https://doi.org/10.3897/arphapreprints.e114920).

David

From Sharif Islam 15 Nov 2023 Dear Anne and David,

Thanks for your comment. I've removed Alex from the thread as he is no longer involved with DiSSCo. The openDS GitHub repository is in need of some attention and housekeeping soon. I've also added Sam, our lead developer, to the thread.

If you don't mind, could you please submit an issue on the openDS GitHub repository for this topic? This will allow us to track and follow the conversation there.

I agree that this is an excellent opportunity to align our efforts and use the momentum we have now with the DiSSCo development work, as well as the new GBIF data model. We have also looked into nanopub as part of the annotation data model.

Just to provide some background and context:

We do a significant amount of harmonisation and mapping before creating the Digital Specimen record, which is essentially applying the openDS data model. This also includes creating a PID for each of these records. Incorporating the Contributor Role Ontology, TDWG attribution concept, and agent action should be feasible.
You can view a JSON representation of our current implementation (https://dev.dissco.tech/api/v1/specimens/TEST/Y9E-KGH-YZH/full). The annotation block includes "creator" and "generator" fields, following the W3C Annotation data model. If necessary, this can accommodate additional attribution terms. The GUI version is available at https://dev.dissco.tech/ds/TEST/Y9E-KGH-YZH. The Annotation data model went through a review process through the RFC. You might be interested in reviewing that as well.
Currently, Annotation (by both machine and human agents) is one of the features we are actively working on.

For the mapping aspect, we are exploring SSSOM in another project (with Claus Weiland), and the MIDS group is also investigating mapping (see TDWG abstract: https://doi.org/10.3897/biss.7.112672). So yes we would be interested in representing the mapping as SSSOM.

For the GitHub issue, to assist us in understanding feature requirements, alignment, and other priorities, it would be useful if you could structure it as follows:

what is the goal of the requested feature?
is there any preconditions/assumptions (such as use of existing ontologies, ORCID integration)
if possible provide a simple example or a basic workflow

regards,

--sharif

From Sam Leeflang 24 Nov 2023 Hi Anne and David,

Thanks for your interest! In addition to Sharif's response, I have to say we are still working on the model of the agent. As I mentioned during the TDWG presentation, openDS is an adoption of the GBIF Unified Model. So most of the credits go to Tim Robertson and John Wieczorek. What we did within DiSSCo for openDS was to make an adaption of this very broad model, focussing specifically on specimen data. We also tried to simplify it a bit (by denormalizing parts) and created a json schema variant of it. This was mainly done for record-by-record processing, for which the highly normalised model gets in the way. So far this exercise has worked out satisfactory, and it looks like we will pursue this implementation.

Regarding the agent object, this requires a bit more thought in my opinion. The GBIF Unified Model splits the agent into three different tables: AgentRole, Agent and AgentRelationship. Because we denormalized it, we combined the AgentRole and Agent into a single object. However, I think this object needs a couple of additional fields to help with the identification of the agent. Ready through the TDWG Attribution GitHub, I think a lot of ideas have already been put forward.

Circling back to your email:

It would be great if the TDWG Attribution group could give input/feedback on the agent object, which has been proposed by GBIF and adopted by DiSSCo in openDS. Would it make sense to set up a meeting together with Tim / John to discuss this?
We would very much like a controlled vocabulary for Role and the suggestions you are making seem very relevant.
Regarding capturing the data mapping in SSSOM. This is a valid remark, and in the future we might capture this in a SSSOM mapping. For now, as we are still very much in flux, the code base is the main point of documentation. For example, we increase the total terms to around 170 with hundreds of mappings to mainly ABCD(EFG) and DWC. When we stabilized the mappings, we will work towards comprehensive documentation. We did do a SSSOM exercise for MIDS (levels 0, 1 and 2) on which Elspeth gave a presentation during TDWG. This SSSOM mapping can be found here: https://docs.google.com/spreadsheets/d/1ydNC8DHnrAPhPhTEQ7RmAztKJjJQCOiyX1wYl0SbFkU/edit?usp=sharing

Kind regards, Sam

From Mathias Dillen 28 Nov 2023 Hi Anne, David, Sharif, Sam, Wouter,

The proposed model for Agents in the new GBIF Common Model aligns quite well with the thinking behind the Agents Attribution extension to DwC. The new model also solves some of the problems with the extension by virtue of being more relational.

In the extension there was an additional layer of different Roles played by Agents within a single Action upon an Occurrence, but this was never fleshed out and seems unnecessarily complicating matters. It would do away with some redundancy in the "AgentRole" table, as you can split off the agentRoleOrder and impose a hierarchy of action->roles. Other than that, the extension maps quite nicely to the proposed Agent Model:

identificationID : Unnecessary since Agents can be linked to any class, including Identifications. This was only needed because of the star schema. startedAtTime: agentRoleBegan endedAtTime: agentRoleEnded displayOrder: agentRoleOrder role: See above. action: agentRoleRole name: preferredAgentName verbatimName: agentRoleAgentName alternateName: Could be added to Agent table. We've used it for cases where Agents have been semantically enriched by connecting their names to PIDs. In such a case, you may have 3+ kind of name strings: one from the source data, one parsed from the source data (e.g. a single name in a team enumeration) and one from the PID authority metadata. We used it for the latter of these three. identifier: identifierValue from the Identifier table. This immediately supports multiple PIDs for a single agent, which was possible but messy in the DwC extension. agentIdentifierType: identifierType from the Identifier table. Keeping this field from getting messy will be tricky given past experiences (e.g. the many different ways ORCIDs get published to GBIF). agentType: agentType from the Agent table. attributionRemarks: This one was not discussed in the DwC extension and is not covered in the GBIF model. We used it in the BiCIKL work to document some provenance data from the attribution process.

I wonder about your combination of AgentRole and Agent into a single object. This brings it closer to what the DwC extension was doing, and also will raise some of its problems again, most notably potential bloat of redundant data in this Agent object. See the examples saved from a test with the DwC extension in GBIF-UAT at the end of this DiSSCo Prepare report: https://www.dissco.eu/wp-content/uploads/DiSSCo-Prepare-D5.4-Semantic-Enhancement-w-doi.pdf If you have the same agent or overlapping teams of agents doing multiple actions to a specimen, you can get quickly to dozens of rows. In this example it was 50+, but that was exacerbated as there were multiple PIDs for these agents as well.

That said, I don't know how far we are from widespread implementation and adoption of these new models. Darwin Core in practice still shows numerous interoperability problems and it is a much simpler structure than the GBIF and openDS models. Hence I still think it is valuable to have the DwC Agents Attribution extension supported, as there are currently no good alternatives within the world of Darwin Core to sharing detailed semantic enrichment of agents.

best regards,

Mathias

DiSSCo / openDS

Combine work of OpenDS and TDWG Attribution IG #68