'Identifier' specification in data model

data2health / contributor-attribution-model

A simple data model to represent contributions made by agents to research artifacts

3 stars 0 forks source link

'Identifier' specification in data model #7

Open mbrush opened 5 years ago

mbrush commented 5 years ago

At present the spec requires identifiers to be specified as curies in string form - see Information Model documentation here. But there are many other ways we might specify/constrain identifier representation. e.g.:

Do not require id to be a curie - just allow any string here but recommend using a curie with a namespace and reference component. This is simplest and least constraining. But if we want to support creation of rdf/linked data then curies/URIs are a requirement.
Require representing curies as structured objects, perhaps adopting the FHIR Identifier data type for this
Allowing both id and Identifier data types (where 'id' would allow for local identifiers without associated namespaces) - again, following FHIR lead here

We should discuss the most practical approach, given the technical context in which the model will be implemented.

mbrush commented 5 years ago

Related is the issue of how to capture internal vs external identifiers for entities such as Person - which may have a local/internal identifier in a database, but also an external id from an authority like ORCID.

Currently the spec advises the following:

Instance identifiers can be internally generated by the system creating the data, or borrowed from external systems/authorities that mint identifiers (e.g. databases, registries, ontologies). For example, a Person instance can be assigned an internal identifier by the system generating the record (e.g. ex:12345), or the system can choose to use an established ORCID for that person (e.g. orcid:0000-0002-1048-5019). If a system wishes to capture both an internal and external identifiers, the internal id should be captured in the 'identifier' attribute, and the 'externalId' attribute can be used to record one or more established ids from community authorities.

We should review this approach and revise as needed.

diatomsRcool commented 5 years ago

I like ORCID, but we may need something for dead people (ISNI?) - if that is in scope.

mellybelly commented 5 years ago

I think that for the artifacts, we need to be much more flexible about the identifier type. CURIEs will not likely be common, and persistence is the more important recommendation. We should provide examples of many different types of useful identifiers. We can require global uniqueness, and resolution as a best practice.

mellybelly commented 5 years ago

@jmcmurry please comment

jmcmurry commented 5 years ago

Everything must be resolvable in some way. If necessary we can use W3ID that we can use for the oddball edge cases.

Do you anticipate having a LOT of different prefixes for this work? Do you anticipate needing to programmatically round trip CURIE->URL->CURIE?

If the answer to either of these is "no", then sticking with CURIE's is cleaner.

Either way, yes -- make recommendations for classes of things and what corresponding identifiers to provide. For people, ORCID >> whatever ORCID doesn't offer that we need to get the job done.

There are tradeoffs, so either way it would be good to have scripts that can examine the identifier health of the corpus. sigh This old chestnut is going to be the death of me :)

mbrush commented 5 years ago

I updated the spec to pull back on the CURIE requirement. I informally required global uniqueness, and resolution as a best practice, and recommend use of CURIEs as one way to achieve this.

Also, informal Implementation Guidance in the spec make recommendations for specific types of identifiers for certain entity types (e.g. ORCID for persons) - we the spec does not formally request or require this.

@jmcmurry can you review the text in the spec about the identifier data type and see if you concur. Feel free to comment on or edit the text (in suggest mode). And as for your questions above, I am not sure yet about how many prefixes will be used in this work, or programmatic round-tripping requirements. Thanks!