Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions)

larrybabb commented 7 months ago

When trying to normalize the variant NC_000015.9:g.7211_7214del the routine will go into a seemingly endless routine to try to figure out the normalized result for the Allele.state.

Without a full analysis their is evidence that this is likely caused by the fact that the first 17 million bases in chromosome 15 are all Ns. So as it rolls right/left to get to a unique sequence region it will go on for an impractical amount of time.

I suggest we put a limit in terms of how large the sequence can grow up to when normalizing the Allele. But we should discuss how to best handle this.

@toneillbroad just suggested that maybe we simply disallow any normalization that includes ambiguity coded bases not A, C, T or G. I sort of like that as a general rule of thumb, since it is very difficult to address the true normality of a sequence that includes any of the ambiguity codes. We can make this a vrs-python rule so that our normalizer doesn't go off and never return in these portions of the reference sequences

larrybabb commented 7 months ago

@ahwagner we would like you to weigh in on this so we can put a stop gap solution into vrs-python ASAP. even if we have to revisit a more formal decision later.

ahwagner commented 7 months ago

I agree with the solution proposed by @toneillbroad.

theferrit32 commented 6 months ago

It sounds like based on discussion with @larrybabb that this is only a problem for genomic sequences, not transcripts.

(so by N below I really mean anything not A C T G)

cases:

substitution: check if the ref or alt includes an N
insertion: check if the sequence being inserted includes an N
deletion: check if the sequence being deleted includes an N
dup: check if the sequence being duplicated includes an N

others?

theferrit32 commented 6 months ago

With great frustration with multithreading in Python, I have found a way to work around this issue in client code at a higher level that doesn't add that much overhead. Using a background task queue, a return value queue, a background process which runs the tasks and can be interrupted, and a timeout on return values, I can terminate any call into a Translator that takes, say, longer than 1 minute, and add an error message to the output file that indicates that variant was skipped.

It may still be nice to implement something in vrs-python which checks the sequence beforehand, or in bioutils during roll left/right, because this would make this available to other codebases which use vrs-python. Or I could look at adding something like a translator wrapper which has the timeout logic built in.

ahwagner commented 6 months ago

@theferrit32 have we proposed implementing this over in Biocommons? I agree that it makes sense for us to implement the solution there.

ga4gh / vrs-python

Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) #397