Open david4096 opened 8 years ago
URI? Or, better base URI (domain) and identifier.
For both of these contexts, the strings are have some known structure. Accessions have the NCBI accession identifier structure, and variants may have various NCBI or Ensembl resource identifiers, but these have known and easily detected prefixes. rs1234 (dbSNP), RCV000076617.2 (ClinVar), COSM476 (COSMIC) etc.
I don't see the issue with the strings, which are ideal for searching and indexing.
The contents of ExternalIdentifier are still strings. The point of the record is to add more context to the strings. In general, guessing the source of identifiers isn't a robust approach (e.g. entrez gene id).
The records probably needs revisiting at some point. The database field might be better served with a URI and it's unclear to me if splitting version is a good approach.
gaberudy notifications@github.com writes:
For both of these contexts, the strings are have some known structure. Accessions have the NCBI accession identifier structure, and variants may have various NCBI or Ensembl resource identifiers, but these have known and easily detected prefixes. rs1234 (dbSNP), RCV000076617.2 (ClinVar), COSM476 (COSMIC) etc.
I don't see the issue with the strings, which are ideal for searching and indexing.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub*
Good points. But for reference accessions, I'm not sure what you would use for the version
field of ExternalIdentifier
. They are by their nature already versioned identifiers. The version of NCBI at the time is irrelevant.
Since version
is required, it would have to be well documented what that value would be in that case.
On the variant side, there is usually a version that may be available for a given database (dbSNP, ClinVar, COSMIC have versions), but that is not always known, and again, the identifiers are supposed to be version free (for RSID), or in the case of ClinVar, they are accessions with a version component built-in.
For the accessions
array (which we haven't discussed for a while) I actually envisioned this as a "soft positive match" construct:
Semantically, externalIdentifiers
would be appropriate for items like "GSM487790"; "orcid.org/0000-0002-9903-4248" would (IMO) be an accession
(as a subset of externalIdentifiers
).
But no matter how we name the attributes, we should provide the framework for both "context dependent" identifiers as well as fully resolving URIs and UUIDs to be rtepresented.
We need ensure exact mapping of IDs to source. The structure of ExternalIdentifier needs revisited, but the concept seems sound.
Michael Baudis notifications@github.com writes:
For the accessions array (which we haven't discussed for a while) I actually envisioned this as a "soft positive match" construct:
• GSM487790 □ would be a - pretty unique - single GEO experiment accession match • 21659463 □ could be anything, but would be a direct PMID match • orcid.org/0000-0002-9903-4248 □ would be a properly prefixed ORCID; 0000-0002-9903-4248 just the "accession"
Semantically, externalIdentifiers would be appropriate for items like "GSM487790"; "orcid.org/0000-0002-9903-4248" would (IMO) be an accession (as a subset of externalIdentifiers).
But no matter how we name the attributes, we should provide the framework for both "context dependent" identifiers as well as fully resolving URIs and UUIDs to be rtepresented.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub*
External identifiers capture more details than an array of accessions. Consider using an array of
ExternalIdentifier
on references as opposed to an array of strings for references. The same might be said of thenames
array for variants.