ga4gh / vrs-python

GA4GH Variation Representation Python Implementation
https://github.com/ga4gh/vrs
Apache License 2.0
50 stars 27 forks source link

How to handle Allele normalization for Range Locations #237

Open ahwagner opened 11 months ago

ahwagner commented 11 months ago

One major theme raised in #234 is the question of "how do we handle Allele normalization when the Allele Location is specified by Ranges"? To me, these have always seemed to be a shorthand for "I did a targeted region assay and want to craft general statements about copy number in those regions and the potential broader impact they have". I know we allow people to create Alleles with Range-based Locations anyway, but... why? The PR supports those cases and raises interesting questions, e.g. what do we do with definite range intervals?

larrybabb commented 11 months ago

All great points. After looking at this for 30 minutes and thinking about it on a Sunday night, I tend to fall on the following side of things...

  1. We should not allow ranges as endpoints in any alleles.
  2. We should allow range endpoints in copy numbers (only at this point)
  3. You cannot normalize a location with one or both endpoints as ranges (definite or indefinite).

I think these range endpoints are only needed for microarray calls (unless someone can educate me otherwise). I believe these microarray calls really only produce representations of deleted or duplicated regions (often times with ambiguous endpoints). I think we will be treating these as copy number variants (CopyNumberChange Variants) in a a way this will help reduce the confusion on what and where these type of variants belong.

Again, I'm no expert in all the places where these type of ambiguous variant calls come from, but I would say that calling them alleles is not exactly aligned with our computational definition. As we have noted many times, any "deletion" could be considered as a molecular variant and thus an Allele, but it is also a copy number (system) loss. Let's discuss further, but that's my Sunday night feedback for what it's worth.

larrybabb commented 10 months ago

@ahwagner is it possible that indefinite or definite range endpoints should be treated as either one of the forthcoming SV breakend or breakpoint classes? I'm still not sure I have my head around the breakend concept completely, but it sure feels like the indefinite ranges are similar to a breakend.

Please educate me on why this is a non-sensical idea.

ahwagner commented 10 months ago

@larrybabb regarding https://github.com/ga4gh/vrs-python/issues/237#issuecomment-1769258749, I think that indefinite range data structures (and the SequenceLocation objects that use them) are compatible with breakend representation. I'm going to be reviewing and commenting on some of the outstanding SV-VRS issues later this week and will come back to this, but wanted to move the discussion about your recent comment over to ga4gh/vrs#365, where this same solution was proposed by @cmprocknow.

github-actions[bot] commented 7 months ago

This issue was marked stale due to inactivity.

larrybabb commented 4 months ago

@ahwagner Where do we stand on this? Are we fully supporting the notion of start and end positions on a SequenceLocation being both an integer as well as a Range ? This is fairly critical if we plan to treat all Range based positions in hgvs expressions as Adjacency types (or Breakends). I'd like to know if we should focus on a firm direction before we go much further. We are about to implement Range in vrs-python for the allele and cnv translators for hgvs expressions like NC_000006.12:g.(?_57046622)_(57088889_?)del.

It seems like we may just presume that any hgvs expression that has a Range endpoint is really a structural variant of type del or dup that can be represented as an Adjacency. Please clarify your perspective here.

ehclark commented 4 months ago

I will make a comment here not fully understanding all the details. But I do think it is relevant.

For CNVs specifically, our current filtration/annotation process uses bedtools intersect. The CNV calls are coming from DRAGEN. The CNV databases we are using include ClinVar, ClinGen, Decipher, GeneDx, Manta, and gnomAD. The typical requirement is a 50% reciprocal overlap between the patient/subject calls and the database.

In the future when we adopt VRS IDs for CNVs I think we will want to be able to do the equivalent of bedtools intersect using the VRS objects. It looks to me like both Range or Adjacency would support this computation. Although for Adjacency if one or both of the adjoinedSequences were IRIs, it could get complicated?