ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
80 stars 34 forks source link

CNV, STRs, somatic Var Rep Group concept needed? #39

Closed larrybabb closed 5 years ago

larrybabb commented 6 years ago

CNVs, microsatellites and a variety of somatic variant representations have given rise to the notion of defining a variant grouping that is a set of variant instances (not necessarily equivalent) which can be used to for annotations, assertions, interpretations, evidence collection, etc...

In our modeling to date we have intentionally been focusing on the most atomic representations, rightfully so. However, with the advent of the copy number discussion, we have introduce the notion of providing a range for the quantity of copies for copy number gain variants.

All previous examples (afaik) have focused on defining a very specific instance of a variant (i.e. allele, haplotype, genotype). We sort of got into this realm of a "set" or "group" of instances when discussion PGx haplotypes as defined by CPIC/PharmGKB, but we never really resolved the concern.

Question... To focus on CNVs and micro-satellites for now, what does it mean to specify a range of copy numbers (i,e. from 5 to 20 -or- more than 47)?

Possible answer.. A CNV instance is a specific number of copies of a given region of a chromosome. The region of the chromosome that has a non-negative number of copies, is the instance of the sequence. So, to specify a "range" of copies is essentially saying any one of the "instances" in this range belongs in this group.

For example, If you wanted to specify that a given interpretation is valid for any copy between 4 and 10 of region 1000 to 2000 on chromosome 1 then you are saying that any specific copy instance between 4 copies and 10 copies would be covered by that interpretation. Interpretation 1... Variant Group : NC_00001.10:1000..2000 (4 to 10 copies) Pathogenicity: Uncertain Significance

Interpretation 2... Variant Group : NC_00001.10:1000..2000 (>10 copies) Condition: Condition X Pathogenicity: Pathogenic

Case 1 specific finding... Variant found: NC_000001.10:1000..20000 (6 copies) Result: interp 1 above matches and the assertion may potentially be used to inform the patient's results.

Case 2 specific finding... Variant found: NC_000001.10:1000..2000 (20 copies) Result: interp 2 above matches and the assertion may potentially be used to inform the patient's results.

Hopefully, this highlights the distinction between defining "variants" that are "sets" or "groups" verses "instances" and the need to be able to do both in order to collect knowledge and associate it with actual findings.

This can also be applied to microsatellites, which are short tandem repeats that often get expressed as a range as well as in the HTT gene for Huntington's disease. see ClinVar NM_002111.6(HTT):c.52CAG(27_35).

Individual assay findings produce a specific count of the tandem repeats and then determine if the fall into the variant group defined by NM_002111.6(HTT):c.52CAG(27_35) or some other group that may have a different interpretation.

As we explore variant representations, let's determine if we need to be separating the notion of atomic, specific, instance representations from group or set representations and provide a clean separation, if so.

larrybabb commented 6 years ago

Also, bear in mind, that while this "group" concept may seem to be similar to genotype (or haplotype) it is different in that haplotype and genotype represent a "complete" set of variants that must all co-occur. This concept is more of an "OR" than and "AND" of grouped variants.

mbaudis commented 6 years ago

@larrybabb I think this moves into the variant annotation area, by mixing cases the need of variant type representation (do we have a proper name for that?) with variant instance representation.

Maybe we should just separate the ways variant types / equivalencies are represented from the instance == case... specific representation, into really different approaches?

So we would have:

For instance:

larrybabb commented 6 years ago

Malachi Griffith added a summarization of distinct types of variants found in CiVIC in Var Anno repo issue 13.

larrybabb commented 6 years ago

@mbaudis we will setup a call for this discussion as it may be too complex to fully separate all the concerns effectively in an issue thread.

But to respond to the three instances at the bottom of your comment above

I’m not sure I agree the following 2 kinds of variants as equivalent in bullet 1 A. Deletion of one allele of a gene B. Deletion of all alleles of a gene

The relationship between these two is a subset superset relationship (I believe). In any case the question I’m trying to answer is “How do we represent item B as a variant, when it appears to be a representation of a set of variants?”

The notion of a set or class of variants was recently spotlighted by Malachi on the Var Anno call as types of Var reps that would be needed to support the “subject” attribute of many of the somatic interp types.

I also see the similarity of this pattern in regards to using copy number ranges to define a set of cnvs which all share a common interp.

Finally I would say that I agree with your third bullet in regards to queries needing the ability to query Var class types and/or copy count quantities.

However we haven’t yet demonstrated how these kind of qualifying attributes will be bundled with variant concepts needed to build objects that can support the role of Var Anno “subjects”.

mbaudis commented 6 years ago

@larrybabb I tried to put too much in the sentence (bullet 1); my note was on the "any deletion of one", and the "any deletion of all" as two different types of equivalence.

In imprecise CNV reports (i.e. w/o phasing), the homozygous deletion would "self compose":

#####___________#####
#######_______#######
222221100000001122222

Without full allelic reconstruction it would not be sure how the 0 comes about; could be

#####_________#######
#######_________#####
222221100000001122222

... and so would be reported as 3 different variants (11, 0000000, 11). So this is a case where we get some meaningful outcome description (yes, there is a homozygous deletion) w/o knowing about the specific alleles.

See example of array based data here.

But such a (widespread, simplistic) model does not cover the composition of multiple variants, just reports the outcome of this composition.

The problem is that we have to accommodate both; but maybe not necessarily in all scenarios. And maybe really thinking this through could help to reduce complexity for each of those implementations.

reece commented 5 years ago

Please use #46 for a consolidated discussion of CNV requirements.