benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

CPU usage for denoise step with multithread=TRUE and pool=TRUE #334

Closed elsherbini closed 7 years ago

elsherbini commented 7 years ago

I'm running dada2 on a node with 20 cores. I pooled 5 samples together to get 750k reads and ran: denoised <- dada(derepF, err=errF, multithread=20, pool=TRUE)

I'm noticing that with either multithread=TRUE or multithread=20 it starts using between 30-50% of the available CPU (using top it says the R process is using ~800% CPU). Then after a few minutes it is only using 100% CPU (5% total of the 20 cores).

Are there parts of the denoising step that are still limited to 1 core when pool=TRUE? And for the part of the process that was running on multiple cores, any idea why it wasn't using all the available resources?

benjjneb commented 7 years ago

The parts outside of the main denoising algorithm are still single-threaded, for example constructing the input and output within R. Those single-threaded parts should be less and less important as the input data gets larger (i.e. should be less of the total time as nreads increases), but could still be a non-trivial part of processing 750k reads with 20 cores.

How long (roughly) was the program running in single-threaded mode? There are also other issues that could crop up due to insufficient memory -- how much memory was available on this node?

elsherbini commented 7 years ago

I think the single threaded part was only 30 minutes out of 90ish total.

I'm now running the full library (~10million reads) in pooled mode with 20 cores and 128GB of memory using multithread=TRUE. Using top, it says the R process is using 4.5% memory and the CPU usage goes between 10%(200%) and 50%(1000%) but usually hangs out between 30% (600%) and 40%(800%). Does that seem about right? Is there a way to get it to use more CPU on average and go faster?

elsherbini commented 7 years ago

And for reference, it finished in about 20 or 24 hours, running the forward and reverse samples on separate 20 core machines.

benjjneb commented 7 years ago

And for reference, it finished in about 20 or 24 hours, running the forward and reverse samples on separate 20 core machines.

That's in the ballpark of what we'd expect. pool=TRUE mode starts to become intractable as you get into the multiple 10s of millions of reads.

Using top, it says the R process is using 4.5% memory and the CPU usage goes between 10%(200%) and 50%(1000%) but usually hangs out between 30% (600%) and 40%(800%). Does that seem about right? Is there a way to get it to use more CPU on average and go faster?

You are seeing practical limitations in our implementation of multi-threading, in particular that not every aspect of the algorithm is multi-threaded, and the non-MT parts become a bigger part of the wall-time (and average processor usage) as cores increase.

We have some ideas on perfomance improvements, in multi-threading and elsewhere, with the goal of allowing pool=TRUE to scale up to more like 100M reads. However, that would be happening for the 1.8 release (Mar 2018) not the next 1.6 release (next month).

If interested in that, I'd recommend keeping an eye on the repository around the start of next year, but no huge performance gains are imminent.

adityabandla commented 6 years ago

Hi Ben

We have some ideas on perfomance improvements, in multi-threading and elsewhere, with the goal of allowing pool=TRUE to scale up to more like 100M reads. However, that would be happening for the 1.8 release (Mar 2018) not the next 1.6 release (next month).

Was this ever implemented? I am running into memory issues while processing a dataset with approx. 9M reads with the pool=TRUE mode

benjjneb commented 6 years ago

@adityabandla Haven't gotten around to it unfortunately.

adityabandla commented 6 years ago

Thanks Ben for the heads up. Any plans of implementing this in the near future? I am really stuck with using the POOL=TRUE mode with a single MiSeq run, with approx. 7M unique sequences in the reverse reads. The forward reads with approx. 5M unique sequences took about 48 h to complete.

With all my datasets, I see that POOL=TRUE always results in high merge rates (as expected I guess). Roughly, 66% merge rates for per sample inference, 80% for Psuedo Pooling and 92% for full pooling.

So it would be great if the full pooling mode can be made faster. I tried benchmarking the time taken for each of these modes and full pooling takes about 7x more than than pseudo pooling and 42x compared to per sample inference

benjjneb commented 6 years ago

@adityabandla I'd love to do it, and know what needs to be done, but it's hard to find the time! I'll let you know, I might be able to get something in for the 1.12 release (March).

For now, is it possible to cut down the number of unique sequences in the reverse reads? That is a pretty high number for a Miseq run which is why you're running into trouble. Can the reverse reads be truncated a little bit shorter and still have enough overlap?