genome / pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
GNU General Public License v3.0
162 stars 89 forks source link

pindel2vcf reference parsing speedup #89

Closed joelmartin closed 6 years ago

joelmartin commented 6 years ago

pindel2vcf switched from getline to reading fasta char by char "Version 0.6.2 [December 12th, 2014] Now robust against fasta files that have non-standard line lengths (C++'s getline does not work well on lines of over a million characters)"

istream getline has that issue, std::getline will expand and doesn't have an issue with lines of any length. This patch restores previous code, from the svn repo, but switched to std::getline instead of the implicit.

for a ~400mb plant assembly with 1300 contigs, processing time dropped to 582 seconds from 2043 for a ~400mb plant assembly with 14 contigs, processing time dropped to 103 seconds from 154 for a 5mb bacterial assembly with 1 contig, difference wasn't reliably detectable.