ga4gh / pedigree

Repository for the family history/pedigree project
https://pedigree.readthedocs.io/
11 stars 3 forks source link

Identifiers #8

Closed mbaudis closed 3 years ago

mbaudis commented 3 years ago

Identifiers are proposed as "list of strings"; however, such a definition is not helpful for persistent or cross-system ids, and does not allow an easy understanding of the scope/target of those identifiers.

In several other schemas we've come to define identifiers in general as CURIEs, wrapped as objects with optional description and/or label attributes. Examples:

Phenopackets

The suggestion is to stick to this general structure & provide identifiers as objects, with documented preference for CURIEs (which may be rare for current use for individuals, but nevertheless seems "forward looking").

buske commented 3 years ago

Thanks for the feedback, @mbaudis! I agree that CURIEs are definitely the right call for communication of Concepts. I'll expand the description of the Concept to add those optional attributes.

For the identifiers field, though, it isn't clear to me what the namespace would be in the vast majority of use cases: medical record numbers and research participant codes. These aren't public identifiers, but rather private ids or URIs for FHIR resources. Thoughts?

mbaudis commented 3 years ago

@buske We basically treat them in the same format; all identifiers that are not primary attached to an internal object (e.g. for us biosample.individual_id => individual.id) are treated in the same way, with the default {id: "__prefix__[:-]__local__", label: "__label__" } format, where the internal prefix is - separated and the label is optional (also: description as optional attribute).

This is nice since it allows to follow the same format definitions and to attach/utilize additional information - we use e.g. icdom-85003 (private version of non-CURIE'd code) but NCIT:3262 (this is also outcome of some GA4GH discussion).

IMO independent if private or CURIE - the existence of an option to assign some descriptive value to an id is very much needed. And there is no real overhead between "__idvalue__" and {id: "__idvalue__"}.

buske commented 3 years ago

@mbaudis I'm still struggling to understand what prefix should or could be used in most cases. If the ID is an internal project identifier: 123_4567, what prefix makes sense? If it's a medical record number for the patient's EHR: 12345, what should the prefix be? I think the conceptual problem is that I don't see a way to practically define these prefixes as part of the software ecosystem in a way that guarantees that the prefix:local is unique across entities, so the prefix doesn't add anything and just confounds things.

I agree wholeheartedly that it should be an object and not a raw string. :)

buske commented 3 years ago

Resolved with the latest changes