ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
79 stars 32 forks source link

Defining Genotypes in VRS - Feedback wanted. #347

Open larrybabb opened 2 years ago

larrybabb commented 2 years ago

I've been of the opinion for some time that VRS would work to provide the basic structures to support the representation of all variation types by computational systems. As such we run into the constant challenge of dealing with what certain terms mean biologically versus how we intend to use and define them computationally. The pace with which we are able to move has been painstakingly slow which many of us would argue is for good reason. We are trying to develop a standard that engineers and computational biologists can clearly see supports their existing use cases, demonstrates a path forward that will not cause a major loss in investment and doesn't have a high barrier for adoption.

We have evolved our variation model into 3 high-level classes, MolecularVariation, SystemicVariation and UtilityVariation. UtilityVariation thus far is mostly a catch all for "everything else" that we have yet to provide a more meaningful representation. This is where TextVariation and VariationSets live. MolecularVariation is the backbone of VRS in that it provides Allele and it's more complex counterpart, Haplotype that represent the state of sequence at a given location on a molecule. Allele is the state at a contiguous location and Haplotype is two or more Alleles in cis (on the same underlying sequence or molecule). SystemicVariation builds on MolecularVariation but does so with a foundation in the state of a "system". Thus far our system has been the "genome" primarily, but their is discussion that this could be cast as other types of systems going forward. CNV is our first SystemicVariation and it represents the precise or ranged number of copies of a given MolecularVariation, Feature (aka Gene so far) or SequenceExpression.

We now are working towards defining Genotype. My limited knowledge of genomics has led me to think of Genotype as the full state of all the instances of a given homologous region within the genome (or something like that). So a heterozygous or homozygous Allele would be a genotype in that you are stating you have either one or two copies of that Allele in total in the genome. My assumption was that these Alleles would always be in trans (or out of phase / unphased?).

It seems clear that we need to provide folks the ability to express the phase of the total copies of a given MolecularVariation, Feature or SequenceExpression. I assumed that Genotypes always implied that the regions being expressed where always "in trans". I've recently been told that you can have two tandem copies of a region in cis and these would represent a genotype. We provide the ability to express a tandem duplication as an Allele with a repeating sequence expression. We also allow repeated sequence expressions (or tandem repeats/dupes) to be used in CNVs to provide a systemic quantification if desired.

We are proposing two new structures to provide the ability to represent the total systemic representation of homologous and/or co-occuring molecular variation, features and/or sequence expressions. Computationally we are proposing a Genotype class to be the representation of molecular variation, features or sequence expressions that are known to be in trans. The Cooccurring class is provided to represent when two or more systemic, molecular, feature or sequence expressions occur together in the same system with no representation of whether these members are in cis or in trans.

We believe that these two classes will allow computational developers to craft any representation of systemic variation they need with the precision or ambiguity they desire.

While the computational term Genotype might not align formally with the biological term, the question here is whether or not we should use a different term for our computational Genotype so that folks don't confuse it's requirement that all it's members are in trans.

larrybabb commented 2 years ago

@rrfreimuth I'm tagging you for feedback since you educated me on the notion that genotypes do not have to be in trans. Please take a look at the rationale above for why we are defining a Genotype structure to mean "in trans". Consider how Cooccurring can be used for systemic representation of multiple molecular, systemic, feature and/or sequence expressions. I think we would use Cooccurring when we don't know if multiple members are in trans or in cis (even if they are homologous regions). We would use Genotype when they are "in trans" (potentially whether or not they are homologous regions). We would use Haplotype when they are "in cis" and that would be more about a Molecular variation rather than a systemic one.

If we should rename Genotype to avoid confusion, please offer a new term.

larrybabb commented 2 years ago

To all ... please add use cases here for how you use Genotypes, Haplotypes and Alleles so that we can better validate the final version of Genotype in VRS 1.3

andreasprlic commented 2 years ago

One use case that comes to my mind in this space are multi-nucleotide-variants. https://www.nature.com/articles/s41467-019-12438-5 . They can occur due to a combination of distinct single-nucleotide mutation events, where each of them has a distinct variant frequency. The way these could be represented could be that each single-nucleotide mutation could be an Allele, which get grouped together as a Haplotype, if both are found on the same chromosome on a sample.

Since the alleles can have different frequencies, I wonder how we would represent a sample where only one of the two alleles has been observed, and we are confident that the other position is like the reference. One option would be again a Haplotype here, which would only contain one allele, to distinguish the two scenarios.

I am sure we could come up with alternative representations though, this is just a quick thought based on our discussion this morning. The gnomad data file for MNVs is available for download here.

larrybabb commented 2 years ago

MNVs are something we haven't spent a ton of time talking about, but in the scenarios above you can represent the MNV as a Haplotype of two alleles if you choose. This may be helpful when trying to sort through members of Haplotypes to find matching Alleles quickly. But you could also create an Allele that is the MNV itself, which may make sense as well. In any case, I think we are responsible for providing some level of options for when computational folks want to do different things. We should not presume to know all the ways they will want to construct their alleles, haplotypes, etc...

I think one of the bigger items we are trying to address is whether we should allow a Haplotype to have only one Allele. And if we do, do we simply call it out as a case whereby the Haplotype of one Allele and that Allele itself can be construed as equivalent by folks. If this is a significant worry, then do we try to provide guidance on this. One alternative is to not allow Haplotypes of one allele. Otherwise we acknowledge that this situation will occur and it is acceptable to VRS.

github-actions[bot] commented 6 months ago

This issue was marked stale due to inactivity.