brentp / bwa-meth

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome
https://arxiv.org/abs/1401.1129
MIT License
139 stars 53 forks source link

Faster 'convert_fasta' and 'as_bam' #11

Closed superbobry closed 6 years ago

superbobry commented 10 years ago

These two seem to be the bottlenecks during alignment. Have you considered using Cython or C/C++ to speed things up?

brentp commented 10 years ago

convert_fasta happens only once at index creation. For as_bam, I think you'd want to benchmark and see where in handle_reads() is the bottleneck. It's either in that function and/or the methods on class Bam

superbobry commented 10 years ago

Right, I meant convert_reads. I'll do the benchmarks and update the issue.

brentp commented 10 years ago

I'd prefer to make the cython dep optional as people seem to have enough trouble with vanilla python modules. There might be some low-hanging fruit that would improve the performance.

Have you noticed that bwa mem is generating alignments faster than the python in bwa-meth can consume them?

superbobry commented 9 years ago

I've recently used bwa-mem for aligning unpaired reads and noticed a slowdown on the Python a side. I'm thinking that treating paired and unpaired reads separately in convert_reads might help, because all of the operations below are useless for the unpaired case:

# vvv --- we can drop the loop as well.
for read_i, (name, seq, _, qual) in enumerate(pair):
    if name is None: continue
    name = name.rstrip("\r\n").split(" ")[0]  # except for this line.
    if name.endswith(("_R1", "_R2")):
         name = name[:-3]
    elif name.endswith(("/1", "/2")):
         name = name[:-2]

What do you think?

brentp commented 9 years ago

If you benchmark and see a significant improvement from that change, I'd accept. Maybe this:

for pair in izip(q1_iter, q2_iter)

could be

qiter = izip(q1_iter, q2_iter) if fq2 != "NA" else q1_iter

then loop over that and it could remove some of the internal checks.

I'd guess that the slowdown might be in processing the bam rather than converting the FASTQ