benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Processing time for dada() with PacBio reads #1877

Closed HeiserS closed 4 months ago

HeiserS commented 8 months ago

Hello, I have been very much enjoying working with dada2 and have successfully used it for a couple of smaller projects. We just got PacBio long-read sequences back which I have been trying to process using one of the tutorials: https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html However, I do not seem to be able to get through the dada() step and am reaching the end of my knowledge/skills.

I did the following:

remove primers and orient reads

nops1 <- file.path(path1, "noprimers", basename(fns1)) #make new folder for output files prim1 <- removePrimers(fns1, nops1, primer.fwd=FWD_UNonMet_SH, primer.rev=rc(REV_3143R), orient=TRUE) #allows for 2 mismatches

I checked after and most (although not all) of them were removed

inspect length distribution

lens.fn <- lapply(nops1, function(fn) nchar(getSequences(fn))) lens <- do.call(c, lens.fn) hist(lens, 100)

Screenshot 2024-01-19 at 8 39 25 AM We are sequencing the SSU, ITS, and most of the LSU -> I was hoping to make the ASV table and afterwards cut out parts for taxonomy assignment. The databases for our target organisms are not great and it would be good to get several regions per ASV for assignment. In addition, I want to be able to pull out full sequences after assignment with taxonomy to look at intraspecific variability.

filter

filts1 <- file.path(path1, "noprimers", "filtered", basename(fns1)) track1 <- filterAndTrim(nops1, filts1, minQ=2, minLen=2000, maxLen=6000, maxN=0, rm.phix=FALSE, maxEE=3) #retained about 84% - relaxed maxEE to let a few more through track1

learn error rates

set.seed(100) err1 <- learnErrors(filts1, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE) saveRDS(err1, file.path(path1 , "Trial_err1.rds")) plotErrors(err1)

Screenshot 2024-01-19 at 8 42 52 AM

My question here is with a mean number of 30 passes, do we still actually need the denoising and/or error step?

derep <- derepFastq(filts1, verbose=TRUE)

I know the derep step is not really necessary anymore with the updated dada() function and is done on the fly, I had left it separately because I was hoping that it would help speed up the next step. I also tried the loop function from the workflow for Big Data but it did not help either.

start.time <- Sys.time() dd1 <- dada(derep, err=err1, BAND_SIZE=32, multithread=TRUE, OMEGA_A=1e-10, DETECT_SINGLETONS=TRUE) end.time <- Sys.time() time.taken <- round(end.time-start.time,2) time.taken

I have been slowly adding samples with more and more sequences to see at what point the processing time increases too much. With the length of the sequences, if we do not "detect_singletons", almost no sequences are returned per sample (mostly just 1).

cbind(ccs=prim1[,1], primers=prim1[,2], filtered=track1[,2], derep=sapply(derep, function(x) sum(x$uniques)), denoised=sapply(dd1, function(x) sum(x$denoised)))

The following summary is for 6 samples which took nearly 3 hours to run: ccs primers filtered derep denoised m84046_231104_180507_s1.bc2001--bc2012.hifi_reads.fastq.gz 5419 3611 3047 3047 3046 m84046_231104_180507_s1.bc2001--bc2019.hifi_reads.fastq.gz 747 587 295 295 295 m84046_231104_180507_s1.bc2002--bc2012.hifi_reads.fastq.gz 5034 3500 2982 2982 2982 m84046_231104_180507_s1.bc2002--bc2020.hifi_reads.fastq.gz 11716 8232 6756 6756 6755 m84046_231104_180507_s1.bc2004--bc2012.hifi_reads.fastq.gz 621 448 351 351 351 m84046_231104_180507_s1.bc2009--bc2013.hifi_reads.fastq.gz 20774 14707 12546 12546 12544

I then added another sample with around 21000 raw CCS and it has been running for almost 24 hours. We have a total of 99 samples with an average of 64000 reads per sample.

I am using a Mac Studio with 192 GB memory an an Apple M2 Ultra Chip (cores: 24 (16 performance and 8 efficiency)). It is using 10-20GB of memory currently for the process.

Is there anything I can do differently or that I may be doing wrong? I have been thinking about running it on a supercomputer or similar but at this stage am not sure what I would have to request to speed up processing time. We were also thinking about using qiime instead with 99% clustering to account for PCR errors. I am worried that the clustering step may also take a very long time.

Any advise you may have would be greatly appreciated. Cheers, Sabrina

benjjneb commented 8 months ago

Hi Sabrina, Depending on the level of replication, it may not be appropriate to use DADA2 to analyze your data. Can you check the output of derepFastq(..., verbose=TRUE) on a couple example samples? Is the number of unique sequences significantly less than the number of reads?

HeiserS commented 8 months ago

Thank you for your response. We do not have much replication at all which is why I went with "DETECT_SINGLETONS=TRUE". Maybe I misunderstood that it was a workaround for the long-reads?

derep <- derepFastq(filts1, verbose=TRUE) Dereplicating sequence entries in Fastq file: /Users/sabrinaheiser/Documents/fastx_files/noprimers/filtered/m84046_231104_180507_s1.bc2001--bc2012.hifi_reads.fastq.gz Encountered 3045 unique sequences from 3047 total sequences read. Dereplicating sequence entries in Fastq file: /Users/sabrinaheiser/Documents/fastx_files/noprimers/filtered/m84046_231104_180507_s1.bc2001--bc2019.hifi_reads.fastq.gz Encountered 295 unique sequences from 295 total sequences read. Dereplicating sequence entries in Fastq file: /Users/sabrinaheiser/Documents/fastx_files/noprimers/filtered/m84046_231104_180507_s1.bc2002--bc2012.hifi_reads.fastq.gz Encountered 2981 unique sequences from 2982 total sequences read. Dereplicating sequence entries in Fastq file: /Users/sabrinaheiser/Documents/fastx_files/noprimers/filtered/m84046_231104_180507_s1.bc2002--bc2020.hifi_reads.fastq.gz Encountered 6756 unique sequences from 6756 total sequences read. Dereplicating sequence entries in Fastq file: /Users/sabrinaheiser/Documents/fastx_files/noprimers/filtered/m84046_231104_180507_s1.bc2004--bc2012.hifi_reads.fastq.gz Encountered 351 unique sequences from 351 total sequences read. Dereplicating sequence entries in Fastq file: /Users/sabrinaheiser/Documents/fastx_files/noprimers/filtered/m84046_231104_180507_s1.bc2008--bc2011.hifi_reads.fastq.gz Encountered 25464 unique sequences from 25465 total sequences read. Dereplicating sequence entries in Fastq file: /Users/sabrinaheiser/Documents/fastx_files/noprimers/filtered/m84046_231104_180507_s1.bc2009--bc2013.hifi_reads.fastq.gz Encountered 12546 unique sequences from 12546 total sequences read.

benjjneb commented 8 months ago

So this data just isn't workable with the DADA2 approach.

DADA2 relies on repeated observations of a sequence in order to seed new ASVs. DETECT_SINGLETONS=TRUE relaxes this, but when there is virtually no duplication, the DADA2 error model will not be learned correctly (see your plots, not what PacBio errors look like), and even that setting won't be effective.

We'd like to work on this more (funding permitting) in the future, but right now a rough rubric is: If duplication is less than 10\% (so #(unique sequences) > 0.9 * #(reads)) then DADA2 isn't appropriate.

HeiserS commented 8 months ago

Thank you, that makes sense!