DiSSCo / openDS

The home of the open Digital Specimen (openDS) specification
Apache License 2.0
16 stars 9 forks source link

Information of ID type would be good. #135

Open snsb-seifert opened 2 weeks ago

snsb-seifert commented 2 weeks ago

Suggestion

In some places external IDs are used, e.g. in dwc:scientificNameID and dwc:acceptedNameUsageID. Is there a possibility to specify the used ID types? Otherwise these IDs can not be used programmatically as the information about the issuing system is missing.

Even in the examples in https://dwc.tdwg.org/list/#dwc_acceptedNameUsageID they need to give additional information on the identifier e.g. (GBIF) or (COL) but this information should be available as a separate ID type or the ID should be in a resolvable format (e.g. URL)

samleeflang commented 2 weeks ago

Hi Stefan,

Thanks for your suggestion! Let me see if I fully understand your comment. When an identifier is added, you would like to know in which system the identifier belongs. So instead of just the term "dwc:taxonID": "BRKJG" you would like to see that this identifier is an identifier from Catalogue of Life (COL). This way, machine agents would know what to do with the BRKJG identifier. A good point, and we don't have any option for that (yet). It would potentially mean adding additional terms for each term that might contain an identifier, and as we follow DarwinCore in most of these place, it would be better to add these to DarwinCore.

What we do have is that we try to create EntityRelationships for each identifier going to an external system. So in addition to storing the "dwc:taxonID": "BRKJG" in the specimen, we also create an EntityRelationship:

{
  "@type": "ods:EntityRelationship",
  "dwc:relationshipOfResource": "hasColID",
  "dwc:relatedResourceID": "https://www.catalogueoflife.org/data/taxon/BRKJG",
  "dwc:relationshipEstablishedDate": "2024-08-23T13:01:44.099Z",
  "ods:RelationshipAccordingToAgent": {
    "@id": "https://hdl.handle.net/TEST/123-123-123",
    "@type": "as:Application",
    "schema:name": "dissco-nusearch-service",
    "ods:hasIdentifier": []
  },
  "dwc:relationshipAccordingTo": "dissco-nusearch-service"
}

This relationship is saying, this specimen in DiSSCo's infrastructure has a relationship with this other infrastructure. It indicates was identified at a specific data by an agent, in this case the name DiSSCo name usage search service. This entityRelationship also contains the relatedResourceID with the resolvable URL to the related resource. I think these entityRelationships will be what the machine is interested in and could you to combine data from different sources.

What do you think, are we taking the right approach here or would you do it differently?

Kind regards, Sam

snsb-seifert commented 2 weeks ago

Hi Sam, you understood me correctly. My approach would have been to narrow the usage of these terms to something resolvable (URL) or more defined as e.g. lsid, in the context of OpenDS.

In your example we learn, that there is a relation "hasColId". But where is the meaning of hasColID defined? It could be that this can be retrieved from the agent description via the given doi. How do we know, that this relation is for e.g. the scientificNameID or the acceptedNameUsageID? Both might be given for one Specimen (or MaterialEntity).

In dwc:scientificNameID the context what this ID is about is clearly given by the term definition. Maybe there is a terminology which defines "hasColID" to be the dwc:acceptedNameUsageID? Or a registry which states that the output of "dissco-nusearch-service" is a direct match to acceptedNameUsageID?

Best regards, Stefan

wouteraddink commented 2 weeks ago

Hi Stefan, thanks for your comment. If we would add additional terms to describe an identifier type, these would probably need to be openDS terms as there is no DwC equivalent. There would then be a few options:

  1. describe an identifier as being either a RESOLVABLE, GLOBAL or LOCAL, or
  2. describe them on a more detailed level: LSID, ARK, PURL, ISBN, or
  3. even a more detailed level like COL ID, CETAF ID, GrSciColl ID instead of PURL.

Describing a type is the most useful for non-resolvable GUIDs, e.g. ISBN, LSID, Crossref Funder ID otherwise there is no way of knowing what the ID represents unless you can deduct it from the string. To make that actionable by machines it would need to be a controlled vocabulary but that is difficult to establish as the number of ID types is growing fast.

PIDs like RORs, ORCIDs, ARKs should always be provided as full URL and then it not so important to know if it is a ROR or an ORCID, as you could get that information by resolving the PID. So if we add a term I would propose to go for option 1. We already do that for the primarySpecimenObjectId in the PID record, where we have primarySpecimenObjectIdType. We could also add a name for the identifier but as a free text string, which is especially useful for local identifiers. We also do that in the PID record for the primarySpecimenObjectId. Let's say you have an identifer 123a, then it would help if you have a locally used name for it, say "registration number".