cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

Consider replacing Gotoh alignment algorithm #556

Open donkirkby opened 4 years ago

donkirkby commented 4 years ago

As I've been working on #549 to add support for SARS-CoV-2 references, I've had some trouble with running out of memory. I think it's partly that I'm running on equipment with less memory than I usually use, and partly that the SARS-CoV-2 genome is longer than HIV or HCV. The specific step that I've had most trouble with is aligning two consensus sequences using our Gotoh algorithm, so maybe it's time to look at alternatives.

@jeff-k had suggested we move from Gotoh to BWA, and that project seems to have been superceded by minimap2. Experiment with these tools for aligning the SARS-CoV-2 consensus sequences, and then decide whether they are worth switching to.

Tasks

donkirkby commented 4 years ago

First impressions of minimap2:

jeff-k commented 4 years ago

For local alignment of two long consensus sequences, assuming one of them spans a range of the other (amplicon vs. reference), Smith-Waterman or Gotoh are sound choices. I have never used minimap2, but maybe the PacBio or Nanopore features for handling long reads would approximate this use case. I don't know what else those features would do, though.

At 30k bp, SARS-CoV-2 is going to stress a SW implementation that builds an entire N x M backtracking matrix. A useful optimization for SW space complexity is banding.

The rust-bio library has a good API for this: https://docs.rs/bio/0.20.3/bio/alignment/pairwise/banded/index.html which would be a good test bed for working out the alignment parameters.