cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

unpack_mixtures_and_reverse becomes computationally intractable when sequence has many ambiguous characters #637

Open dmacmillan opened 3 years ago

dmacmillan commented 3 years ago

https://github.com/cfe-lab/MiCall/blob/f79fca0d135dd08d22c4324f9e81bc819db813ce/micall/utils/probe_finder.py#L127

I traced a bug to this function where if the input argument seq contains too many ambiguous characters (ones that are resolved to many, i.e. '-' -> 'ACTG') then the function will hang because it attempts to resolve every possible sequence (which can be a very large number).

To correct this I count all mixture characters first and instantiate a count=1. For each mixture I multiply the count by the number of potential resolution characters. I.e. "A-A-" has 2 mixture characters, '-' resolves to A,C,T, or G so it can resolve to 4 potential sequences. Since I see two '-' characters I multiply the count by 4 twice. I then create a threshold (i.e. 1000) such that if the count (which equals the number of potential sequences) is greater than that number, this function returns with nothing.

Let me know if this is a viable solution or if you have a different or better proposal.

donkirkby commented 3 years ago

Sounds like maybe you're using that function for something different than it was designed for. Are you passing in sample data?

As far as I remember, that was intended to handle primer sequences that contained mixtures, so it would look for all the possible combinations. Are you searching for a primer sequence that contains dashes?

dmacmillan commented 3 years ago

Yes, it ties into ProbeFinder which I use to locate primers within sample sequences. One of the culprit sequences I found contained many dash characters.

donkirkby commented 3 years ago

Is that culprit sequence a primer?

dmacmillan commented 3 years ago

Yes and no, I try to align the ends of the samples to the ends of HXB2 which contain primers but are not uniquely primers as they have extra sequence on the ends in case the samples only align a small segment within the primer regions

donkirkby commented 3 years ago

OK, so you want to look at the ends of the sample sequence to see if they match anything in HXB2 relatively close to the expected primer regions? Here are some ideas:

  1. What you've already done: expand dashes to match any base, and give up if there are too many dashes. I assume the dashes come from bases with low coverage near the end of the contig, right?
  2. Leave the dashes in place, and try to align it anyway. I can't remember exactly what the Gotoh algorithm does with dashes.
  3. Get rid of the dashes by using bases even when they have lower than 100 coverage.
  4. Trim more off the end of the contig before you search for it in HXB2. Trim until you get below some threshold of dashes.
  5. Use the genome_coverage.csv file to tell you where the end of the contig aligns in HXB2. In v7.13, that is based on minimap2.