brentp / bwa-meth

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome
https://arxiv.org/abs/1401.1129
MIT License
144 stars 54 forks source link

Support for BWA-MEM2 #71

Closed PoisonAlien closed 3 years ago

PoisonAlien commented 3 years ago

Hi,

This PR adds support for aligning with BWA-MEM2 aligner which offers significant speed with identical alignment to BWA-MEM

Usage:

Only change is indexing step which requires mem2 tag, in the absence of which bwa-meth uses mem.

bwameth.py index $REF #Indexes with BWA-MEM (default)
    #OR
bwameth.py index mem2 $REF #Indexes with BWA-MEM2

Alignment step does not change and, the aligner MEM or MEM2 is automatically chosen based on the index type

bwameth.py --reference $REF some_R1.fastq.gz some_R2.fastq.gz > some.output.sam

Benchmark:

I have tested it on 250K, 150bp paired-end reads. Runtime bench-marked with hyperfine


$ cat gen_mem1_sh.sh 
#!/usr/bin/env bash
python ../bwameth.py --reference ../bwameth_idx/bwamem1/chm13.draft_v1.1.fasta ../fastq/R1.fq.gz ../fastq/R2.fq.gz > mem1.sam

$ cat gen_mem2_sh.sh 
#!/usr/bin/env bash
python ../bwameth.py --reference ../bwameth_idx/bwamem2/chm13.draft_v1.1.fasta ../fastq/R1.fq.gz ../fastq/R2.fq.gz > mem2.sam

$ hyperfine --warmup 2 './gen_mem1_sh.sh' './gen_mem2_sh.sh'
Benchmark #1: ./gen_mem1_sh.sh
  Time (mean ± σ):     173.818 s ±  1.141 s    [User: 980.671 s, System: 8.116 s]
  Range (min … max):   171.717 s … 175.572 s    10 runs

Benchmark #2: ./gen_mem2_sh.sh
  Time (mean ± σ):     104.469 s ±  0.473 s    [User: 477.614 s, System: 21.802 s]
  Range (min … max):   103.605 s … 104.972 s    10 runs

Summary
  './gen_mem2_sh.sh' ran
    1.66 ± 0.01 times faster than './gen_mem1_sh.sh'

Output

Both mem and mem2 produce near identical alignments. Only difference is in the header @PG line


$ python ../bwameth.py --reference ../bwameth_idx/bwamem1/chm13.draft_v1.1.fasta ../fastq/R1.fq.gz ../fastq/R2.fq.gz > mem1.sam

$ python ../bwameth.py --reference ../bwameth_idx/bwamem2/chm13.draft_v1.1.fasta ../fastq/R1.fq.gz ../fastq/R2.fq.gz > mem2.sam

$ diff mem1.sam mem2.sam 
26c26
< @PG   ID:bwa-meth PN:bwa-meth VN:0.2.2    CL:"../bwameth.py --reference ../bwameth_idx/bwamem1/chm13.draft_v1.1.fasta ../fastq/R1.fq.gz ../fastq/R2.fq.gz"
---
> @PG   ID:bwa-meth PN:bwa-meth VN:0.2.2    CL:"../bwameth.py --reference ../bwameth_idx/bwamem2/chm13.draft_v1.1.fasta ../fastq/R1.fq.gz ../fastq/R2.fq.gz"

Summary

Aligning with MEM2 is ~66% faster than the MEM1 with identical alignment.

My python isn't super good but these changes don't seem to break any existing changes and woks fine. Hope this will help many others like me who is or planning to migrate to MEM2 for speed.

Thank for all your awesome work :)

brentp commented 3 years ago

This looks great! What do you think about having

bwameth.py index
# and
bwameth.py index-mem2

?

so we can enumerate the options a bit more easily (not needed for this PR). This might help in future if we support bwa-meme: https://www.biorxiv.org/content/10.1101/2021.09.01.457579v1

I could go either with how you have it now or this way, your call.

PoisonAlien commented 3 years ago

Sure, index-mem2 is more intuitive. I have pushed a patch - feel free to update/change if you would like. BWA-MEME seems like it improves upon MEM2 speed. Worth trying for sure, thanks.

brentp commented 3 years ago

thanks for the contribution. this is very cool!