GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

threading #36

Closed ttriche closed 6 years ago

ttriche commented 6 years ago

Silly question: would it be feasible to thread samblaster? (i.e., would it be worth my taking a whack at it)

Reason I ask is that it's choking on biscuit (https://github.com/zwdzwd/biscuit) output when I run biscuit with loads of threads (40-56) and perhaps it's possible to overcome this one way or another. I hate the idea of sorting twice, and I hate the reality even worse. But with enough reads we choke samblaster.

I'm willing to help with threading, but I was wondering if the authors can immediately point out a choke point that makes this a dumb and pointless idea. This is not meant as a feature request per se!

GregoryFaust commented 6 years ago

As each read-id set must be compared to all previous read-id sets to see if they are duplicates, it is not clear how much help threading would be. That is why we did not use threads in the first place. Can you please be more explicit what you mean by "choking"? samblaster has routinely been run attached to the output of bwa mem running at 48-56 threads without the least slow down whatsoever.

ttriche commented 6 years ago

We've been running biscuit (https://github.com/zwdzwd/biscuit), which is essentially bwa-mem for bisulfite sequence data, through samblaster, and while it does not have any problem with eRRBS data, it gets in trouble with full WGBS runs (typically biscuit takes 2-3 days maximum to align a 30x-40x WGBS run; with samblaster pipelined between its output and samtools view, we've seen 10-day runtimes not complete).

It's possible that this is a biscuit issue, but it's troubling that as the input scales up, the slowdown seems to get worse. Needless to say, I think samblaster is terrific and would very much like to use it, not only for dupe marking but also for split, discordant, and unmapped/clipped read extraction. I'm having trouble figuring out why it would slow biscuit down so much (especially since biscuit's output is so similar to bwa-mem's output). Any particularly obvious places to start looking for trouble?

Thanks for a great tool.

GregoryFaust commented 6 years ago

First of all, a point I should have made in the first post, samblaster tends to be IO bound, not CPU bound; another reason threading does not seem a good direction to head. As to what could be slowing down your WGS runs, have you looked at what is going on in top? For example, is the pipeline of tools including biscuit and samblaster running out of memory? samblaster will use memory proportionate to the number of read-ids in the input, so if this goes high, and especially if this then causes your virtual memory to start thrashing, runtimes will go up precipitously. However, it is hard to see how this would vary much from one 30-40x run to another unless there is other activity on the server at the same time that is also highly memory consumptive. My advice is to run this pipe alone on the server and to consider tuning the amount of memory you give the sort command (say in sambamba) so that biscuit, samblaster, and the sort do not combined use more than the physical memory on the server. Once swapping begins, runtimes will go up dramatically. Many labs run their servers with swapping turned off and just tune pipelines to make things fit.

ttriche commented 6 years ago

True -- I suppose I hadn't considered that samblaster and samtools could be contending for resources -- which is rather silly on my part. Thanks for pointing this out.

--t

On Mon, Apr 23, 2018 at 1:09 PM, Gregory Faust notifications@github.com wrote:

First of all, a point I should have made in the first post, samblaster tends to be IO bound, not CPU bound; another reason threading does not seem a good direction to head. As to what could be slowing down your WGS runs, have you looked at what is going on in top? For example, is the pipeline of tools including biscuit and samblaster running out of memory? samblaster will use memory proportionate to the number of read-ids in the input, so if this goes high, and especially if this then causes your virtual memory to start thrashing, runtimes will go up precipitously. However, it is hard to see how this would vary much from one 30-40x run to another unless there is other activity on the server at the same time that is also highly memory consumptive. My advice is to run this pipe alone on the server and to consider tuning the amount of memory you give the sort command (say in sambamba) so that biscuit, samblaster, and the sort do not combined use more than the physical memory on the server. Once swapping begins, runtimes will go up dramatically. Many labs run their servers with swapping turned off and just tune pipelines to make things fit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GregoryFaust/samblaster/issues/36#issuecomment-383650670, or mute the thread https://github.com/notifications/unsubscribe-auth/AAARIlhYn2VzT4KO0QNE4qwNC86fKLRyks5trgrYgaJpZM4TaOqc .

ttriche commented 6 years ago

nb. This turned out to be an issue where Biscuit was providing Samblaster with overly-wide intervals for read pair matching. Once that was fixed, samblaster went back to being fast as hell. Samblaster itself maxed out below 2GB of RAM usage; biscuit and samtools, by contrast, can eat up plenty.

Thanks again for a remarkably fast and effective tool.

--t

On Fri, May 25, 2018 at 2:44 PM, Gregory Faust notifications@github.com wrote:

Closed #36 https://github.com/GregoryFaust/samblaster/issues/36.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GregoryFaust/samblaster/issues/36#event-1647265325, or mute the thread https://github.com/notifications/unsubscribe-auth/AAARIvQsfDfLgdIR7RSkCybj0tewlBpTks5t2FDzgaJpZM4TaOqc .