Change accessions array to ExternalIdentifier array

david4096 commented 8 years ago

External identifiers capture more details than an array of accessions. Consider using an array of ExternalIdentifier on references as opposed to an array of strings for references. The same might be said of the names array for variants.

mbaudis commented 8 years ago

URI? Or, better base URI (domain) and identifier.

gaberudy commented 8 years ago

For both of these contexts, the strings are have some known structure. Accessions have the NCBI accession identifier structure, and variants may have various NCBI or Ensembl resource identifiers, but these have known and easily detected prefixes. rs1234 (dbSNP), RCV000076617.2 (ClinVar), COSM476 (COSMIC) etc.

I don't see the issue with the strings, which are ideal for searching and indexing.

diekhans commented 8 years ago

The contents of ExternalIdentifier are still strings. The point of the record is to add more context to the strings. In general, guessing the source of identifiers isn't a robust approach (e.g. entrez gene id).

The records probably needs revisiting at some point. The database field might be better served with a URI and it's unclear to me if splitting version is a good approach.

gaberudy notifications@github.com writes:

For both of these contexts, the strings are have some known structure. Accessions have the NCBI accession identifier structure, and variants may have various NCBI or Ensembl resource identifiers, but these have known and easily detected prefixes. rs1234 (dbSNP), RCV000076617.2 (ClinVar), COSM476 (COSMIC) etc.

I don't see the issue with the strings, which are ideal for searching and indexing.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub*

gaberudy commented 8 years ago

Good points. But for reference accessions, I'm not sure what you would use for the version field of ExternalIdentifier. They are by their nature already versioned identifiers. The version of NCBI at the time is irrelevant.

Since version is required, it would have to be well documented what that value would be in that case.

On the variant side, there is usually a version that may be available for a given database (dbSNP, ClinVar, COSMIC have versions), but that is not always known, and again, the identifiers are supposed to be version free (for RSID), or in the case of ClinVar, they are accessions with a version component built-in.

mbaudis commented 8 years ago

For the accessions array (which we haven't discussed for a while) I actually envisioned this as a "soft positive match" construct:

GSM487790
- would be a - pretty unique - single GEO experiment accession match
21659463
- could be anything, but would be a direct PMID match
orcid.org/0000-0002-9903-4248
- would be a properly prefixed ORCID; 0000-0002-9903-4248 just the "accession"

Semantically, externalIdentifiers would be appropriate for items like "GSM487790"; "orcid.org/0000-0002-9903-4248" would (IMO) be an accession (as a subset of externalIdentifiers).

But no matter how we name the attributes, we should provide the framework for both "context dependent" identifiers as well as fully resolving URIs and UUIDs to be rtepresented.

diekhans commented 8 years ago

We need ensure exact mapping of IDs to source. The structure of ExternalIdentifier needs revisited, but the concept seems sound.

database - this should more strictly specified. probably as a URI
identifier - should include the version number when this is the convention
version is defined as `version of the object or the database' and is required. Mixing these concepts is problematic. For instance refseq has release versions, however the entry versions change between releases. The full release version are not that useful, they manage the FTP site and can't be easily use for looking up specific entries.

Michael Baudis notifications@github.com writes:

For the accessions array (which we haven't discussed for a while) I actually envisioned this as a "soft positive match" construct:

• GSM487790 □ would be a - pretty unique - single GEO experiment accession match • 21659463 □ could be anything, but would be a direct PMID match • orcid.org/0000-0002-9903-4248 □ would be a properly prefixed ORCID; 0000-0002-9903-4248 just the "accession"

Semantically, externalIdentifiers would be appropriate for items like "GSM487790"; "orcid.org/0000-0002-9903-4248" would (IMO) be an accession (as a subset of externalIdentifiers).

But no matter how we name the attributes, we should provide the framework for both "context dependent" identifiers as well as fully resolving URIs and UUIDs to be rtepresented.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub*

ga4gh / ga4gh-schemas

Change accessions array to ExternalIdentifier array #604