Consistent approach to identifiers/accessions and their associated versions

andrewyatz commented 4 years ago

Our current sequence feature #19 proposal refers to identifiers (id) however example payloads describe these as a single field without a version. This issue is here to collate examples of the use of identifiers in the wild and how versioning is modelled in payloads or in other standards.

andrewyatz commented 4 years ago

Ensembl holds an identifier as a separate field to its version. This is persisted into our core database schemas where stable ids are strings and versions are ints. Creating the ID.VERSION string requires domain knowledge to know you can concat both together with a . mark.

This persists into our REST API representations where id and version are separate fields.

A version increment for transcripts has meant different things over time but now it represents a change in the splicing or resulting sequence (CDNA, CDS, peptide). Those sequences can also be influenced by sequence edits e.g. selenocystine edits. Gene version increments mean there's a change to the underlying transcript set. Proteins when there is a change in the peptide sequence. Note the transcript logic changed post GRCh38 so we have identical versioned transcripts on 37 and 38.

Feedback from the clinical community has stated that having versioned stable ids is essential for clear tracking.

larrybabb commented 4 years ago

It comes down to being very precise about defining the identifier that represents a stable concept accurately.

Based on @andrewyatz description above, the "transcript" concept defined by the ensembl id alone does not include the specificity of representing "...a change in the splicing or resulting sequence (CDNA, CDS, peptide)." A version appended to the ensembl id represents the more specific concept. This is a critical difference and we should take care to recognize the differences between the concepts defined by the identifier represented by just the ensembl "id" versus the identifier represented by the ensembl "id+ver". These can both be relevant and useful identifiers that distinguish two dependent or associated concepts that are different and important to distinguish (as they seem to be).

Let's not get tied down with presuming that just because there's an "id" that it is the only concept and all versions of it are really the same thing. Each version "or instance" of the "id"s general concept are dependent but should not be considered equal.

larrybabb commented 4 years ago

in some cases "versioning" can be changes to a single thing over time - or change events to a single concept. And in other cases, it can be different forms of a similar things, where the "un-versioned" concept is more of a generalization and does not contain any of the specificity associated with any one version.

larrybabb commented 1 year ago

@andrewyatz I'm not sure my comments above addressed the concern you raised. When you have time, could you please clarify if this issue is still a concern and what "we" need to do (and where we need to do it) so that we can align our perspectives?

andrewyatz commented 1 year ago

Hey @larrybabb. Yes your comments did address the issue but also I am unsure with the current state of VA if this is a continued valid concern.

The concepts that IDs (with or without versions) represents needs to be understood based on the provider and our examples should represent the best case scenario for a data point. Otherwise implementations may incorrectly assume what is meant by a field (ID in GFF3 is a perfect example of this). More details about a specific example below.

Going back to #19 though there is a section which states:

exonNumber: "4/4",
gene: "APOE",
transcript: "ENST00000252486"

Stating ENST00000252486 is insufficient to recapture what actually was exon 4 as the number of exons and its splicing within a transcript is permitted to change. If the original data source only recorded ENST00000252486 there's not a lot we can do. If the model examples omit version numbers where they are essential then lossy statements will be produced. ENST00000252486 has been through 9 versions, the latest being a truncation in the 5' UTR. That would not be important to this specific statement but would be if the exonNumber was 1/4.

ga4gh / va-spec

Consistent approach to identifiers/accessions and their associated versions #61