dotnetbio / bio

Bioinformatics library for .NET
Apache License 2.0
143 stars 50 forks source link

Speed bottleneck: Avoid re-parsing bam header when reading many intervals #40

Open RoboStephen opened 4 years ago

RoboStephen commented 4 years ago

I've got come code which parses many intervals from a bam file, with calls to the ParseRange() method. The results are correct, but slower than expected.

Reading through the source code (and running the profiler), it looks like the slowdown came from time spent re-parsing the bam header and index in each each call to ParseRange(). Most of my time was spent in BamParser.GetHeader() or BamIndexStorage.Read(). For the particular use case I'm working on, the overhead of re-reading and re-parsing is high. (My bam file has >10000 contigs. I'm parsing several thousand intervals. There are ~100 aligned reads for a typical interval, though some have far higher coverage).

I'd like to speed this up, preferably in a non-hacky way. I've got a change to my copy of the code that does the trick for my use case. The parser caches its I added an optional "useCaching" flag to the ParseRange() method (false by default). If the flag is set to true, the parser re-use the existing SAMAlignmentHeader and BAMIndex objects (if they exist - if they're null we parse as usual). With the current patch it's up to the caller to avoid setting this flag to true when switching between bam files, though that could be an easy sanity-check to add. (And of course, this caching would not be appropriate when parsing a bam file that has been updated on disk in between one ParseRange() call and the next). I can polish that up for a pull request if that sounds like a sane way to proceed and not too much of a corner case to worry about.

jamesmhogan commented 4 years ago

Thanks for the post. Sounds a sane approach and if I understand the proposal simpler cases would be unaffected. Happy to take a look.

On Fri., 18 Oct. 2019, 11:25 am RoboStephen, notifications@github.com wrote:

I've got come code which parses many intervals from a bam file, with calls to the ParseRange() method. The results are correct, but slower than expected.

Reading through the source code (and running the profiler), it looks like the slowdown came from time spent re-parsing the bam header and index in each each call to ParseRange(). Most of my time was spent in BamParser.GetHeader() or BamIndexStorage.Read(). For the particular use case I'm working on, the overhead of re-reading and re-parsing is high. (My bam file has >10000 contigs. I'm parsing several thousand intervals. There are ~100 aligned reads for a typical interval, though some have far higher coverage).

I'd like to speed this up, preferably in a non-hacky way. I've got a change to my copy of the code that does the trick for my use case. The parser caches its I added an optional "useCaching" flag to the ParseRange() method (false by default). If the flag is set to true, the parser re-use the existing SAMAlignmentHeader and BAMIndex objects (if they exist - if they're null we parse as usual). With the current patch it's up to the caller to avoid setting this flag to true when switching between bam files, though that could be an easy sanity-check to add. (And of course, this caching would not be appropriate when parsing a bam file that has been updated on disk in between one ParseRange() call and the next). I can polish that up for a pull request if that sounds like a sane way to proceed and not too much of a corner case to worry about.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dotnetbio/bio/issues/40?email_source=notifications&email_token=AB2SCGUH7AV262COYDWVNI3QPEGCHA5CNFSM4JCBMJKKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HSTWL7Q, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2SCGUBAADXEAJ72WGODL3QPEGCHANCNFSM4JCBMJKA .

RoboStephen commented 4 years ago

Sounds good! What's the best way to share the code changes? I tried submitting a pull request, but I don't think I have the permissions to push my local branch (cacheBamHeader) to github. For now, let me upload the diffs for the two affected files - that's better than nothing!

Diff.BAMParser.txt Diff.BAMParserExtensions.txt

RoboStephen commented 4 years ago

Also....it's a nitpick, but there's one L too many in the file name DotNetBio-Fulll.sln (should be Full instead of Fulll). As long as we're contemplating changes we could fix that too.

RoboStephen commented 4 years ago

Update: Here are diffs for a new-and-improved fix (all tests pass, and a fix for issue #41 is incorporated as well) MyDiffs.txt