Open larrybabb opened 7 months ago
@ahwagner we would like you to weigh in on this so we can put a stop gap solution into vrs-python ASAP. even if we have to revisit a more formal decision later.
I agree with the solution proposed by @toneillbroad.
It sounds like based on discussion with @larrybabb that this is only a problem for genomic sequences, not transcripts.
(so by N
below I really mean anything not A
C
T
G
)
cases:
others?
With great frustration with multithreading in Python, I have found a way to work around this issue in client code at a higher level that doesn't add that much overhead. Using a background task queue, a return value queue, a background process which runs the tasks and can be interrupted, and a timeout on return values, I can terminate any call into a Translator
that takes, say, longer than 1 minute, and add an error message to the output file that indicates that variant was skipped.
It may still be nice to implement something in vrs-python which checks the sequence beforehand, or in bioutils during roll left/right, because this would make this available to other codebases which use vrs-python. Or I could look at adding something like a translator wrapper which has the timeout logic built in.
@theferrit32 have we proposed implementing this over in Biocommons? I agree that it makes sense for us to implement the solution there.
When trying to normalize the variant
NC_000015.9:g.7211_7214del
the routine will go into a seemingly endless routine to try to figure out thenormalized
result for theAllele.state
.Without a full analysis their is evidence that this is likely caused by the fact that the first 17 million bases in chromosome 15 are all
N
s. So as it rolls right/left to get to a unique sequence region it will go on for an impractical amount of time.I suggest we put a limit in terms of how large the sequence can grow up to when normalizing the Allele. But we should discuss how to best handle this.
@toneillbroad just suggested that maybe we simply disallow any normalization that includes ambiguity coded bases not
A, C, T or G
. I sort of like that as a general rule of thumb, since it is very difficult to address the truenormality
of a sequence that includes any of the ambiguity codes. We can make this a vrs-python rule so that our normalizer doesn't go off and never return in these portions of the reference sequences