RLE bug with large deletions

larrybabb commented 7 months ago

@ahwagner I can't get my head around this one. I believe this is a bug. please verify.

When I run the following hgvs expression through vrs-python's translate_from()...

NC_000016.9:g.89306725_89339913del
curl -X 'GET' \
  'https://normalize.cancervariants.org/variation/translate_from?variation=NC_000016.9%3Ag.89306725_89339913del&fmt=hgvs' \
  -H 'accept: application/json'

I get a RLE state with a length of 1, a repeatSubunitLength of 33189 and a sequence of T. In the following code block I'm showing the actual sequence surrounding the positions being deleted. With the before and after bases being shown in this example, I'm fairly confident that the entire 33,189bp span of sequence is repeating. So, I believe we have an error somewhere in this logic.

NC_000016.9:g.89306725_89339913del

--- 1-indexed positional coordinates (hgvs)
   89306725....89339913
       |---------|            
   G T T C ... A T C ...
--- 0-indexed interbased coordinates (vrs)
   G T T C ... A T C ...
    |             |
 89306723.....89339913

RLE.length = 1, sequence = "T"

theferrit32 commented 7 months ago

It looks like maybe it is because the original sequence NC_000016.9:g.89306725_89339913 is bounded on the left and right by a T.

Original sequence length: 33188
Normalized sequence length: 33190
VRS repeatSubunitLength: 33189
original_sequence[:20]='TCGAGACCAGCCTGGCCAAC'
normalized_sequence[:20]='TTCGAGACCAGCCTGGCCAA'
original_sequence[-20:]='AGGTCAAGAGATCGAGACCA'
normalized_sequence[-20:]='GGTCAAGAGATCGAGACCAT'

(from notebook showing T before and after the reference sequence: https://gist.github.com/theferrit32/c9347dd7a5db88b986055bfde7f434c4)

So the result is correct (I think). But maybe we want to clarify or change how we handle large deletions. Would state.sequence="" and state.length=0 also be an accurate reflection of the variant state, and be more clear? And not roll left/right?

ehclark commented 7 months ago

The current behavior looks correct to me. And the normalized form needs to be state.sequence="T" and state.length=1 because both NC_000016.9:g.89306725_89339913del and NC_000016.9:g.89306724_89339912del would normalize to the same location and state.

ahwagner commented 7 months ago

Agreed, no error in the logic. This seems to be working per the algorithm.

If we want to look for counts of specific repeating subunits, this is a separate, easy-to-implement method that can operate on an RLE Allele.

ga4gh / vrs-python

RLE bug with large deletions #363