DiSSCo / openDS

The home of the open Digital Specimen (openDS) specification
Apache License 2.0
17 stars 9 forks source link

Combine work of OpenDS and TDWG Attribution IG #68

Open diatomsRcool opened 1 year ago

diatomsRcool commented 1 year ago
diatomsRcool commented 1 year ago

The following are emails related to this issue:

From Anne Thessen 15 Nov 2023

Hello! I've been reviewing my notes from TDWG and reading about the details of OpenDS, including the GitHub repo. This, of course, is very relevant to the work of the TDWG/RDA attribution working group and you use these recommendations in OpenDS. The TDWG Attribution Interest Group was initially working on a DwC agent extension, but in light of OpenDS, David and I agree that this is not the way to go. We think that supporting attribution and provenance within OpenDS would be more beneficial. After reviewing the OpenDS GitHub repo, I have the following question/suggestion: For the agents.json -> Would you like to use a controlled vocabulary for role? Might I suggest the Contributor Role Ontology? David and I have already added some biodiversity relevant roles and these are also reflected in VIVO. The TDWG Attribution IG could focus on building out the CRO to support the needs of OpenDS and DiSSCo. If you agree, we can refocus this IG on this task.

An unrelated question..... Any interest in representing the mappings (in the harmonization folder) in SSSOM?

From David Shorthouse 15 Nov 2023 Anne,

Thanks for renewing dialog here & my apologies for being silent. A couple things of note here, should they be useful to consider.

David

From Sharif Islam 15 Nov 2023 Dear Anne and David,

Thanks for your comment. I've removed Alex from the thread as he is no longer involved with DiSSCo. The openDS GitHub repository is in need of some attention and housekeeping soon. I've also added Sam, our lead developer, to the thread.

If you don't mind, could you please submit an issue on the openDS GitHub repository for this topic? This will allow us to track and follow the conversation there.

I agree that this is an excellent opportunity to align our efforts and use the momentum we have now with the DiSSCo development work, as well as the new GBIF data model. We have also looked into nanopub as part of the annotation data model.

Just to provide some background and context:

For the mapping aspect, we are exploring SSSOM in another project (with Claus Weiland), and the MIDS group is also investigating mapping (see TDWG abstract: https://doi.org/10.3897/biss.7.112672). So yes we would be interested in representing the mapping as SSSOM.

For the GitHub issue, to assist us in understanding feature requirements, alignment, and other priorities, it would be useful if you could structure it as follows:

regards,

--sharif

From Sam Leeflang 24 Nov 2023 Hi Anne and David,

Thanks for your interest! In addition to Sharif's response, I have to say we are still working on the model of the agent. As I mentioned during the TDWG presentation, openDS is an adoption of the GBIF Unified Model. So most of the credits go to Tim Robertson and John Wieczorek. What we did within DiSSCo for openDS was to make an adaption of this very broad model, focussing specifically on specimen data. We also tried to simplify it a bit (by denormalizing parts) and created a json schema variant of it. This was mainly done for record-by-record processing, for which the highly normalised model gets in the way. So far this exercise has worked out satisfactory, and it looks like we will pursue this implementation.

Regarding the agent object, this requires a bit more thought in my opinion. The GBIF Unified Model splits the agent into three different tables: AgentRole, Agent and AgentRelationship. Because we denormalized it, we combined the AgentRole and Agent into a single object. However, I think this object needs a couple of additional fields to help with the identification of the agent. Ready through the TDWG Attribution GitHub, I think a lot of ideas have already been put forward.

Circling back to your email:

Kind regards, Sam

From Mathias Dillen 28 Nov 2023 Hi Anne, David, Sharif, Sam, Wouter,

The proposed model for Agents in the new GBIF Common Model aligns quite well with the thinking behind the Agents Attribution extension to DwC. The new model also solves some of the problems with the extension by virtue of being more relational.

In the extension there was an additional layer of different Roles played by Agents within a single Action upon an Occurrence, but this was never fleshed out and seems unnecessarily complicating matters. It would do away with some redundancy in the "AgentRole" table, as you can split off the agentRoleOrder and impose a hierarchy of action->roles. Other than that, the extension maps quite nicely to the proposed Agent Model:

identificationID : Unnecessary since Agents can be linked to any class, including Identifications. This was only needed because of the star schema. startedAtTime: agentRoleBegan endedAtTime: agentRoleEnded displayOrder: agentRoleOrder role: See above. action: agentRoleRole name: preferredAgentName verbatimName: agentRoleAgentName alternateName: Could be added to Agent table. We've used it for cases where Agents have been semantically enriched by connecting their names to PIDs. In such a case, you may have 3+ kind of name strings: one from the source data, one parsed from the source data (e.g. a single name in a team enumeration) and one from the PID authority metadata. We used it for the latter of these three. identifier: identifierValue from the Identifier table. This immediately supports multiple PIDs for a single agent, which was possible but messy in the DwC extension. agentIdentifierType: identifierType from the Identifier table. Keeping this field from getting messy will be tricky given past experiences (e.g. the many different ways ORCIDs get published to GBIF). agentType: agentType from the Agent table. attributionRemarks: This one was not discussed in the DwC extension and is not covered in the GBIF model. We used it in the BiCIKL work to document some provenance data from the attribution process.

I wonder about your combination of AgentRole and Agent into a single object. This brings it closer to what the DwC extension was doing, and also will raise some of its problems again, most notably potential bloat of redundant data in this Agent object. See the examples saved from a test with the DwC extension in GBIF-UAT at the end of this DiSSCo Prepare report: https://www.dissco.eu/wp-content/uploads/DiSSCo-Prepare-D5.4-Semantic-Enhancement-w-doi.pdf If you have the same agent or overlapping teams of agents doing multiple actions to a specimen, you can get quickly to dozens of rows. In this example it was 50+, but that was exacerbated as there were multiple PIDs for these agents as well.

That said, I don't know how far we are from widespread implementation and adoption of these new models. Darwin Core in practice still shows numerous interoperability problems and it is a much simpler structure than the GBIF and openDS models. Hence I still think it is valuable to have the DwC Agents Attribution extension supported, as there are currently no good alternatives within the world of Darwin Core to sharing detailed semantic enrichment of agents.

best regards,

Mathias