genome / pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
GNU General Public License v3.0
162 stars 89 forks source link

Speed improvements, especially for sorry genomes #92

Open joelmartin opened 6 years ago

joelmartin commented 6 years ago

pindel2vcf runs very slowly on plant genomes that aren't in the very best of shape, the current version can take many days to process pindel output. The changes in this pull request let us process results in a reasonable amount of time.

Changes in this pull request are: use fai fasta file index to avoid parsing entire reference file multiple times, it had been at least once + once per contig in results. The fai file is currently required by pindel so I believe it's reasonable to assume it exists.

Index first occurrence of each chromosome in each result file pindel _D,_INT etc... during first pass scan in GetSampleNamesAndChromosomeNames. Then use that to avoid reparsing entire pindel output files on every new contig.

limit calls to isSVSummarizingLine by checking if line starts with digit first.

use std::getline instead of read by char; I've tested std::getline with fasta sequence up to 400mb on a single line, it has no issues. I'm guessing the version note about getline having issues referred to std::istream::getline which needs buffer management.

timing; kitaake - 12 chromosomes followed by 1300 scaffolds ( ~400mb ) v 0.6.3 56 minutes v 0.6.0 5 minutes v this 30 seconds

nipponbare - 12 chromosomes and 2 organelles ( ~400mb ) v 0.6.3 241 seconds v 0.6.0 55 seconds v this 50 seconds

panicum - 9 chromosomes followed by 8400 scaffolds ( ~550 mb ) result files pre-grepped for ChrID lines v 0.6.3 killed after 3 days. Estimate over a month. v 0.6.0 22 hours 46 minutes v this 41 minutes

clostridium - 1 contig, 3.5mb v 0.6.3 2 seconds v 0.6.0 1 second v this 1 second