benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
466 stars 142 forks source link

Is there any way to further speed up dada? #1976

Open constructivedio opened 3 months ago

constructivedio commented 3 months ago

Hi and thanks for providing this amazing tool.

I am currently running dada on a set of very deep sequenced samples. Around 3-6M 300bp NextSeq reads per sample remain after filtering.

What i'm currently doing now:

  1. As you previously suggested here, I've sped up the learnErrors process by subsampling my samples (using 10% of the reads) and it worked.

  2. I then parallelise the sample inference step for each sample. I am using a 96 cpus machine, so I give each sample 8 cpus and I can see the jobs are using ~30-60gb of memory each. However this step has been running for about 16hrs and not even the forward reads have been finished processing. sam = sample.names[1] ddF <- dada(filtFs, err=errF, multithread=TRUE) ddR <- dada(filtRs, err=errR, multithread=TRUE) merger <- mergePairs(ddF, filtFs, ddR, filtRs, verbose=TRUE) filtFs and filtRs are pointing to only one file each.

Number of unique sequences differ from sample to sample but the (unique_sequences/reads) ratio is always around 0.25: Sample 1 - 4908635 reads in 1163375 unique sequences.

Adapters have been removed and reads have been trimmed both using fastp and dada: filterAndTrim(forward_files, filtFs, reverse_files, filtRs, trimRight=c(10,10), maxN=0, truncQ=2, maxEE=c(2,6), truncLen=c(230, 230), rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=TRUE)

Do you have any suggestions on how to speed this up?

Thank you so much

benjjneb commented 3 months ago

Extremely deep samples are the most difficult for DADA2 performance-wise, so there is no silver bullet fix. Some things that can help are to be more stringent about filtering (this will reduce the number of unique sequences in the data by removing reads more likely to contain errors) and truncating reads to be shorter (reduce alignment time). My experience suggests that your data is tractable, I have run dada2 on 1-2M unique sequences data, but it does take a while.

constructivedio commented 3 months ago

Thanks so much for your prompt response! I’ll play a bit more with filtering.

Would splitting a sample’s reads into sub samples, run dada on the subsamples and then sum the counts across the subsamples help you think? And if so, what kind of reads amount I should try per subsample?

Once again thanks for taking the time to help!

benjjneb commented 3 months ago

Would splitting a sample’s reads into sub samples, run dada on the subsamples and then sum the counts across the subsamples help you think?

Yes this would speed things up, but it isn't recommended because you are reducing your ability to detect rare variants in each split. The better approach is to crank up the quality filtering -- throwing away lower quality reads to reduce overall read count (or more importantly for DADA2 computation time, unique sequences in the data) is a win-win.