The-Sequence-Ontology / SO-Ontologies

Collect of SO Ontologies
Creative Commons Attribution 4.0 International
95 stars 37 forks source link

Add short tandem repeat variants to sequence_alteration #359

Open thefferon opened 8 years ago

thefferon commented 8 years ago

sequence_alteration needs a term or terms to refer to STR alleles ('microsatellites') that differ in repeat unit number from the reference. Existing terms under _sequencealteration* are not appropriate, in part because the last two terms below have no definitions:

Current SO:

Recommend introducing the following term at a level immediately beneath sequence_alteration:

If more granularity is needed, the following child terms could be added:

Note that, because STRs are highly variable, any example allele – including the allele represented by the reference – is properly considered as one variant along a spectrum. The SO terms should therefore accommodate the STR allele that happens to be represented by the reference as only one among many possible alleles.

thefferon commented 8 years ago

Regarding the addition of terms for expansion, contraction, etc.: I don't see the value in tracking and specifying when a STR allele is longer or shorter than the reference. The best way to understand STR variation at a particular locus, IMO, is to know how many different alleles there are and what are their relative frequencies in a population. The fact that a particular allele was found in the reference is arbitrary and largely irrelevant.

thefferon commented 8 years ago

Need some guidance and/or discussion on this – 

There should be a term under sequence_alteration for a variant alleles of a short tandem repeat feature. (For the purposes of this issue, I am using 'short tandem repeat' to be synonymous with, or inclusive of, microsatellite, dinucleotide repeat, trinucleotide repeat, short sequence repeat, mononucleotide repeat, etc.).

As explained above, there are currently two terms, located under sequence_alteration / substitution, that might fit the bill but do not currently have any definition:

SO:0000207 simple_sequence_length_variation (no definition) SO:0000248 sequence_length_variation (no definition)

I propose, in order of preference, either:

  1. Introducing a new term directly under sequence_alteration called short_tandem_repeat_variant with an appropriate definition, OR
  2. Adding definitions to the existing two terms above, such that they accommodate STR variants.

Discussion?

keilbeck commented 8 years ago

Hi Tim

Sorry - this came thru while I was traveling, and I am still getting my bearings. This is one of those areas that has been difficult to deal with because these features appear in the reference and also as alterations.

I do like the way you are thinking about this. Can we schedule a call for next week to hammer out what needs to be done, get the definitions perfected etc?

Do you see the granular terms as effects or alterations?

--Karen

thefferon commented 8 years ago

Hi @karen, Yes, a call would be good. Please contact me with your availability.

Do you see the granular terms as effects or alterations?

As alterations. For example, consider an STR for which the reference sequence has 8 TG repeats: TGTGTGTGTGTGTGTG If a variant were observed that had 9 repeats: TGTGTGTGTGTGTGTGTG I would want to identify it first and foremost as an alteration of the STR feature; secondarily it could be specified that the observed allele is longer than the reference, i.e. a short tandem repeat expansion. Similarly, the observation of an allele with only 5 repeats: TGTGTGTGTG would be considered a short tandem repeat contraction because it is an allele of the STR that happens to be shorter in length than the reference.

However, as I commented earlier, specifying whether the observed allele is shorter or longer than the reference only clouds one's understanding of the nature of STRs, and of the fact that the allele present in the reference is arbitrary – there may be 6 different alleles segregating in a population, and the reference must reflect only one of those alleles – other than the fact that it was observed in the sample that was used for the reference, it is nothing special compared to the other alleles. So, defining the other alleles as longer or shorter than the reference would be heading in the wrong direction.

There is of course a similar issue with the reference and SNVs. At many SNP loci, the reference sequence has what turns out to be the minor allele, when you consider an entire population. Yet we still have to deal with the fact that it IS in the reference, and it is what other alleles are compared to.

keilbeck commented 8 years ago

Here is what I suggest

  1. the two undefined terms need to be merged and moved under sequence_alteration as they are not substitutions. Proposed definition: A kind of sequence alteration whereby the length of a feature is changed, usually by repeat expansion or contraction.
  2. Your new term to be a kind of sequence length variation short tandem repeat variation A kind of sequence variant whereby a tandem repeat is expanded or contracted with regard to the reference.

Your granular subtypes should be child nodes. Do you need to annotate when the sequence is the same as reference? I'm just thinking about the implication for all other kinds of alteration.

Also I notice that miso is not pointing to the right place... --K

keilbeck commented 8 years ago

regarding miso - look at latest svn rather latest release.

If you annotate the cases where you are the same as reference as short tandem repeat variation and the effect to be no variation, would that work for your annotation? That way you are saying its a repeat region, and that it is the same as reference. That way we don't have to make an alteration term for every kind of feature that is the same as reference.