FAIRsFAIR / FAIRSemantics

MIT License
7 stars 1 forks source link

#P-Rec. 1: Use Globally Unique, Persistent and Resolvable Identifier for Semantic Artefacts, their content and their versions #1

Open ghost opened 4 years ago

ghost commented 4 years ago

Description:  Semantic artefacts are typically structured text files. They are de facto digital objects and should be unambiguously identified by globally unique, persistent and resolvable identifiers (GUPRI). In the context of a web of FAIR data, these identifiers should be resolvable and support the retrieval of both the semantic artefact itself and also its metadata (see Rec. 2 regarding metadata). As shown in fig. 1, semantic artefacts are composite digital objects requiring at least three levels of identifiers: one for the semantic artefact itself, one for its content and one for the metadata (including both the global metadata and the metadata associated with the content). The latter is described in the following recommendation (Rec. 2). Finally, semantic artefacts are living digital objects by nature, evolving over time. Another specific GUPRI should be added to track the different versions of semantic artefacts allowing to get the latest version but also to have access to previous version in use in existing information systems.  

As Web-based documents, semantic artefacts are usually identified by globally unique (i.e. two different files cannot have the same identifier) and resolvable identifiers. In the scope of WWW, usually semantic artefacts are represented by two key URIs: the URI pointing to the file and the URI namespace of the semantic artefact. As an example, consider a semantic artefact hosted on github which has a local namespace that points to the content of the artefact. This goes against the principle of uniqueness of the identifier. To solve this issue, the namespace and the file URI can be joined through HTTP redirects. This doesn’t address the issue of persistence. To cope with these issues, the Web community developed the concept of Persistent URLs[1]and implemented dedicated servers guaranteeing the persistence of the URL and any associated necessary HTTP redirects. The value of this approach has been identified in the Biomedical domain by the OBO foundry which explicitly recommends the usage of PURL for identifying semantic artefacts within its ID policy (see Existing recommendations). However it has demonstrated a limitation when in 2016 the central PURL server has been stopped due to lack of funding. Fortunately the system has been integrated into a more perenne organisation, the Internet Archive[2]. Finally, the Industrial Ontology Foundry recommends using IRI (enabling the use of Unicode for defining web addresses) that are registered in their system.  

Another alternative to implement GUPRIs is the use of Persistent IDs based on the handle system[3]. The handle is a Web-based identification system using a prefix which identifies a "naming authority" and a suffix which gives the "local name" of a resource that can be resolved through a handle server which will provide direct access to the associated metadata through a redirect to the landing page corresponding to the record for human consumption. This approach is currently being investigated and promoted through the scientific data community (RDA, EOSC, …). A particular kind of handle i.e. the DOI could be used to identify a particular which should support citations (see Rec. 17). However, a limitation of the DOI is that it only refers to the landing page which represents a dead-end for machines. One of the limitations of the PIDs compared to URL/URI is the lack of control of the practitioners. PID are attributed by international organisations which require you to pay a fee for minting new PIDs. In a sense this business model allows to foster the perennity of the Ids. However, it does require to use a dedicated service to mint and affect new PIDs. 

As discussed as introduction, these identification systems should apply to the semantic artefact but also to its content. Indeed, semantic artefacts can be considered as datasets of concepts and relations. Therefore, in this context, each element of the semantic artefact should also have an associated GUPRI. Both OBO Foundry and Industry Ontology Foundry are proposing to use special conventions to define URI based identifiers (see BP-Rec. 1 and BP-Rec 2).  

Finally, a unified identifier schema should be used to identify each version of semantic artefact. This can be done using versioned URI as proposed by OBO Foundry. Using GUPRI for the different version allows information systems to retrieve automatically the latest version and older versions of the semantic artefact.  

This recommendation emphasizes the need for reliable and persistent identification systems without any technical constrains.   Related recommendations: W3C Data on the Web - Best Practice 9: Use persistent URIs as identifiers of datasetsnamespace[4] OBO Foundry - Principle 3[5] OBO Foundry - Identifier Policy[6] OBO Foundry - Principle 4[7] Industrial Ontology Foundry - principle 11 IRI and identifier space Industrial Ontology Foundry - principle 12 Identifier and naming conventions EOSC PID policy recommendation (Hellström et al., 2019) |  

Stakeholders: Practitioner and Repository

alko-k commented 3 years ago

For NVS my interpretation is: Artefact URI: http://vocab.nerc.ac.uk/collection/A05/current/ Content URI: Each of the concepts in A05 like http://vocab.nerc.ac.uk/collection/A05/current/EV_AIRHUM/ Version URI: http://vocab.nerc.ac.uk/collection/A05/current/EV_AIRHUM/1/

jonquet commented 3 years ago

AgroPortal (CC @EamdouniGIT)

AgroPortal supports identification with both URI and GUID, it also assigns ontologies an acronym identifier which is unique but not universal (e.g., ENVO, PPO, AGROVOC). This acronym (stored in omv:acronym) is usually chosen by the author or AgroPortal's administration with respect to the best practices in certain community. AgroPortal ontology metadata model offers two properties for storing the URI (omv:uri) and an additionnal external identifier such as a PURL or a DOI (dct:identifier). Plus, it also allow to store a version specific URI (owl:versionIRI). AgroPortal's metadata model does not yet offer a property to store the GUID of the metadata if there are not included in the ontology file.

Here is an example of an ontology with the 3 ids: http://data.agroportal.lirmm.fr/ontologies/EOL/latest_submission?include=URI,identifier&apikey=528c4e4a-5c3e-4798-a2e2-11d96761b8ce Or in the UI : http://agroportal.lirmm.fr/ontologies/EOL/

mehdiabbasi commented 3 years ago

There are URI for ICES vocabularies:

  1. List of Code Types https://vocab.ices.dk/services/rdf/collection/SHIPC/26D1
  2. List of Codes in specific CodeType https://vocab.ices.dk/services/rdf/collection/SHIPC
  3. Code https://vocab.ices.dk/services/rdf/collection/SHIPC/26D1
graybeal commented 3 years ago

I'm not sure why everyone is listing their IRI schemes—to illustrate that approaches diverge perhaps?

Note there is a distinction between the GUPRIs for the artefact as chosen by its authors—these are typically the namespaces and concept identifiers—and the identifiers chosen by repositories, which can be multiple (as Clement points out) and may build on the original identifiers.

I'm not sure what this item is recommending. There are many declarations of rationales that I take issue with, for example: "Using GUPRI for the different version allows information systems to retrieve automatically the latest version and older versions of the semantic artefact." I agree GUPRIs for versions allow retrieval of different versions of the same artefact; but not automatically, and not latest or older automatically, unless the GUPRI scheme is explicit, universal, and embeds semantics within it. (Which is held by many to be a bad practice.)

In a similar vein, I either disagree with or do not understand most of the second paragraph. The third paragraph lists many options (none of which I would support as a required standard) without prioritizing any. The obvious option of using persistent IRIs that are chosen to never change and that follow the namespace/identifiers of the original ontology is worth at least mentioning.

If i boil the ocean I get the following should statements, which I propose be listed explicitly in the final part of the recommendation for clear discussion.

  1. these identifiers should be resolvable and support the retrieval of both the semantic artefact itself and also its metadata
  2. these identification systems should apply to the semantic artefact but also to its content
  3. each element of the semantic artefact should also have an associated GUPRI
  4. a unified identifier schema should be used to identify each version of semantic artefact

Re 1, I don't know exactly what you are saying—simply that there should be a way to retrieve the metadata? Or that it should be "embedded" in the identifier in some way? (If I have a REST call to retrieve the metadata, that seems more than adequate.)

Re 2, does this just mean the same as 3? If not, what is the additional constraint that 2 represents?

Re 4, what does 'unified' mean in this context?