benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
462 stars 142 forks source link

learnErrors() on PacBio 16S very slow #2005

Open trawler-crawler opened 3 weeks ago

trawler-crawler commented 3 weeks ago

Hi,

I'm trying to use dada2 to analyse PacBio 16S full amplicon data. I only have four samples which I filtered with these settings:

> track2 <- filterAndTrim(list.files(input_dir, pattern="\\.fastq\\.gz$", full.names=TRUE),
                         filts2, 
                         minQ=3, minLen=1100, maxLen=1800, 
                         maxN=0, rm.phix=FALSE, maxEE=2, 
                         compress=TRUE, multithread=TRUE)

> print(track2)
                             reads.in reads.out
1.fastq.gz      190343     63771
2.fastq.gz        187175     63361
3.fastq.gz   203640     68187
4.fastq.gz        184003     62988

The resulting fastq.gz files are 22 to 23 mb large.

Dereplication shows:

> drp2 <- derepFastq(filts2, verbose=TRUE)
Dereplicating sequence entries in Fastq file: C:/Users/1.fastq.gz
Encountered 56052 unique sequences from 63771 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/2.fastq.gz
Encountered 54948 unique sequences from 63361 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/3.fastq.gz
Encountered 55390 unique sequences from 68187 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/4.fastq.gz
Encountered 53148 unique sequences from 62988 total sequences read.

I realise I have a lot of unique sequences.

Next, I want to learn the error rates but it is extremely slow. My output so far:

> err2 <- learnErrors(drp2, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)
186771417 total bases in 127132 reads from 2 samples will be used for learning the error rates.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.

LearnErrors() has been running for about 24 hours now and is still going. Is this normal? Can it be sped up somehow?

My computer should be sufficient to calculate. I have 32 GB of RAM and a CPU with 8 cores, 16 logical processors and base speed 4.20 Ghz. CPU consumption for R process is consistently around 30-35%

Any help with this would be very much appreciated. Thank you.

benjjneb commented 3 weeks ago

learnErrors is running the core DADA2 denoising algorithm repeatedly. Increasing numbers of unique sequences, increasing true diversity in a sample, and increasing sequence length all increase computation time. learnErrors can be the most computationally intensive step in the DADA2 workflow, and long read data is more challenging computationally, so what you're seeing is not out of the realm of expectations. That said, given your number of unique sequences, I would expect this to be computationally tractable in ~days on a laptop.

A couple things you can do to speed things up a a bit:

  1. Filter more stringently.
  2. remove the BAND_SIZE=32 parameter (default is 16 and will be faster)
  3. Move to a higher performance compute node.
  4. Check whether there is any issue with memory pressure. If so, remove the derepFastq step and run learnErrors on the filtered fastq files instead (they will then be loaded into memory one-at-a-time on the fly).

CPU consumption for R process is consistently around 30-35%

This seems low to me. Is this 30-35% of all available processors? Or 30-35% of a single processor?