caleblareau / mgatk

mgatk: mitochondrial genome analysis toolkit
http://caleblareau.github.io/mgatk
MIT License
101 stars 27 forks source link

Speed up bam splitting for bcall #58

Closed AntonJMLarsson closed 2 years ago

AntonJMLarsson commented 2 years ago

I made two changes to split_barcoded_bam.py which speeds it up roughly 10-fold:

Both changes turn these two tasks from linear-time to constant-time operations, thereby speeding up the overall code.

Best, Anton

caleblareau commented 2 years ago

This looks great! Thank you for the PR!! Can I ask about the 5 threads in the Alignment file import? should I parameterize that or is that a magic number?

AntonJMLarsson commented 2 years ago

I've found that it usually helps a little bit to speed up the reading. More than 5 threads usually have diminishing returns, so having more than that is probably not helpful. But as you can see I decided to remove it in the next commit, since it should depend on the -c option and that is not currently an argument to this particular script. If it was implemented as an argument I'd do something like threads = min(5,cores) to ensure not more cores are being used than what has been specified.

Regardless, the main speed-up is certainly the two main changes detailed in the first comment.