benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

LearnErrors just hangs #847

Closed hrivera28 closed 4 years ago

hrivera28 commented 4 years ago

Hello!

I'm trying to process a 16S amplicon sequencing dataset (250 bp, paired end reads, sequenced on a MiSeq2000).

When I try to run LearnErrors though the function just doesn't finish. I'm running it on an iMac Desktop computer with 8GB of RAM.

R version: 3.6.1 dada2 version: 1.12.1 Rcpp version: 1.0.2 ShortRead version: 1.42.0

My sequence qualities are really good (graphs of a few samples attached). I've read through all the learnerrors threads I could find on here and none of the solutions seem to quite fit what's happening on my end.

I've already filtered out reads that don't have correct primers and then trimmed off primers and adaptors plus some end trimming on both forward and reverse.

This is my filterandTrim command: out <- filterAndTrim(fnFs, filtFs, rev=fnRs, filt.rev=filtRs, truncLen=c(245,235), maxN=0, #DADA does not allow Ns maxEE=c(3,3), truncQ=2, trimLeft=23, rm.phix=TRUE, #remove reads matching phiX genome compress=TRUE, multithread=TRUE)

I've also upped the maxEE from 1 to 3 (shown here) and that hasn't helped either.

If I run derepFastq these are my levels of unique sequences: Encountered 16370 unique sequences from 18244 total sequences read. Encountered 22378 unique sequences from 27513 total sequences read. Encountered 58342 unique sequences from 124661 total sequences read. Encountered 48698 unique sequences from 120389 total sequences read. Encountered 37699 unique sequences from 88624 total sequences read. Encountered 522208 unique sequences from 1544645 total sequences read. Encountered 31756 unique sequences from 69208 total sequences read. Encountered 68921 unique sequences from 173209 total sequences read. Encountered 54133 unique sequences from 134507 total sequences read. Encountered 132451 unique sequences from 362523 total sequences read. Encountered 37699 unique sequences from 88624 total sequences read. Encountered 522208 unique sequences from 1544645 total sequences read. Encountered 31756 unique sequences from 69208 total sequences read. Encountered 68921 unique sequences from 173209 total sequences read. Encountered 54133 unique sequences from 134507 total sequences read. Encountered 132451 unique sequences from 362523 total sequences read. Encountered 16370 unique sequences from 18244 total sequences read. Encountered 22378 unique sequences from 27513 total sequences read. Encountered 58342 unique sequences from 124661 total sequences read. Encountered 48698 unique sequences from 120389 total sequences read. Encountered 37699 unique sequences from 88624 total sequences read. Encountered 522208 unique sequences from 1544645 total sequences read. Encountered 31756 unique sequences from 69208 total sequences read. Encountered 68921 unique sequences from 173209 total sequences read. Encountered 54133 unique sequences from 134507 total sequences read. Encountered 132451 unique sequences from 362523 total sequences read. Encountered 121724 unique sequences from 333999 total sequences read. Encountered 83667 unique sequences from 228580 total sequences read. Encountered 75078 unique sequences from 200021 total sequences read. Encountered 93557 unique sequences from 260807 total sequences read. Encountered 47432 unique sequences from 132093 total sequences read. Encountered 57233 unique sequences from 173633 total sequences read. Encountered 78514 unique sequences from 224353 total sequences read. Encountered 47667 unique sequences from 122320 total sequences read. Encountered 27713 unique sequences from 63978 total sequences read. Encountered 67253 unique sequences from 183778 total sequences read. Encountered 112961 unique sequences from 321857 total sequences read. Encountered 36715 unique sequences from 108725 total sequences read. Encountered 67437 unique sequences from 181394 total sequences read. Encountered 56632 unique sequences from 134078 total sequences read. Encountered 44295 unique sequences from 164577 total sequences read. Encountered 101958 unique sequences from 285904 total sequences read. Encountered 22367 unique sequences from 49611 total sequences read. Encountered 34534 unique sequences from 79680 total sequences read. Encountered 53579 unique sequences from 148981 total sequences read. Encountered 67999 unique sequences from 180582 total sequences read. Encountered 82171 unique sequences from 233690 total sequences read.

So they're a bit high (mean is 42% of reads are unique) but the total number of seqs isn't that high(?). Even when I only try to use samples with fewer unique reads LearnErrors stalls.

I've also tried running this on a different system and get the same issues so it's not a package/installation problem presumably.

Any advice?

QUALITY

hrivera28 commented 4 years ago

Oh and I forgot to add that the only output I get is to the R window stating that it's using X number of reads across 4 samples and then nothing else. It doesn't make it to any of the other progress output lines.

benjjneb commented 4 years ago

That looks pretty normal. Nothing suggests that you should be running into any particular issues with learnErrors.

The one thing I notice is that you are a bit memory constrained. That can lead to dramatic slowdowns in commands if swapping of memory becomes required. I'd suggest trying the following:

Run the newest tutorial, in which you don't have a derepFastq step, but instead just run learnErrors on the filtered files directly. This will only load into memory the small number of samples needed for learning the errors, which should comfortably fit into memory. If memory is the issue, this should complete much faster.

ps: When you say it "doesn't finish", how long did you give it?

hrivera28 commented 4 years ago

Hi Benjamin,

Thanks for the quick response! So I’ve tried both giving it just the list of filtered files and a derep object, both stall. The longest I gave it was about 30 or so hours? Left it overnight and checked the next day.

Thanks again! Hanny

Sent from my iPhone

On Oct 3, 2019, at 18:41, Benjamin Callahan notifications@github.com wrote:

 That looks pretty normal. Nothing suggests that you should be running into any particular issues with learnErrors.

The one thing I notice is that you are a bit memory constrained. That can lead to dramatic slowdowns in commands if swapping of memory becomes required. I'd suggest trying the following:

Run the newest tutorial, in which you don't have a derepFastq step, but instead just run learnErrors on the filtered files directly. This will only load into memory the small number of samples needed for learning the errors, which should comfortably fit into memory. If memory is the issue, this should complete much faster.

ps: When you say it "doesn't finish", how long did you give it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

hrivera28 commented 4 years ago

I also blasted a small subset of sequences and they match what I expect, mostly 16S or some of them are coral mitochondrial seqs (this is a coral microbiome project) so it seems like the sequences should be okay as well and I haven’t just sequenced random things...

Sent from my iPhone

On Oct 3, 2019, at 18:41, Benjamin Callahan notifications@github.com wrote:

 That looks pretty normal. Nothing suggests that you should be running into any particular issues with learnErrors.

The one thing I notice is that you are a bit memory constrained. That can lead to dramatic slowdowns in commands if swapping of memory becomes required. I'd suggest trying the following:

Run the newest tutorial, in which you don't have a derepFastq step, but instead just run learnErrors on the filtered files directly. This will only load into memory the small number of samples needed for learning the errors, which should comfortably fit into memory. If memory is the issue, this should complete much faster.

ps: When you say it "doesn't finish", how long did you give it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

benjjneb commented 4 years ago

Do you have access to another machine that you could try to run the data on? Nothing else sticks out to me, so that's probably the next thing I would try just to rule out any funky installation/machine-specific issues.

hrivera28 commented 4 years ago

Hi Benjamin,

Yea I tried on two laptops as well... haven’t tried on a cluster yet though.

Hanny

Sent from my iPhone

On Oct 3, 2019, at 18:49, Benjamin Callahan notifications@github.com wrote:

 Do you have access to another machine that you could try to run the data on? Nothing else sticks out to me, so that's probably the next thing I would try just to rule out any funky installation/machine-specific issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

benjjneb commented 4 years ago

If you could try on a cluster I would. I'd also try running it with just one sample on your laptop to see if that still hangs.

If those don't work, could you share a sample or 5 with me so I can try to reproduce the behavior on my end?

gnanibioinfo commented 4 years ago

Dear Benjjneb,

As hirevera28, i am also having similar issue when processing a 16S amplicon sequencing dataset (301 bp, paired end reads, sequenced on a MiSeq2000).

The LearnErrors function is looks like hanged for more than 16 hrs. I am running it on Windows Desktop computer with 8GB of RAM.

The following are the parameters for fileterAndTrim function

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(277,222), trimLeft = c(17,21), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=FALSE) # On Windows set multithread=FALSE

it resulted the following output reads.in reads.out A-0NA_0_L001_R1_001.fastq 219311 185917 A-0NA_1_L001_R1_001.fastq 221428 186809 A-0NA_2_L001_R1_001.fastq 233437 197783 A-0NB_3_L001_R1_001.fastq 228026 193247 A-0NB_4_L001_R1_001.fastq 220343 189165 A-0NB_5_L001_R1_001.fastq 210740 174047

when i ran Learn Errors function errF <- learnErrors(filtFs, multithread=FALSE)

its has resulted 148332340 total bases in 570509 reads from 3 samples will be used for learning the error rates. But its still running, seems to be hang..

Please any suggestions to move further!

benjjneb commented 4 years ago

@gnanibioinfo In your case it seems more likely that it is just taking a while to run. I'd strongly suggest turning multithreading on for the learnErrors step. It is only the filterAndTrim step where multithreading is not suggested on Windows, the other steps will make use of it just fine.

hrivera28 commented 4 years ago

Hi Ben,

Sorry for the delay. Haven't tried it on a cluster yet but here are a few sample files (3 samples, 6 files (F+R) reads). Let me know if you'd like more. These versions are the ones I'm feeding into learnErrors, so they should have primers and adaptors removed.

DO5_F_filt.fastq.gz DO5_R_filt.fastq.gz M2_F_filt.fastq.gz M2_R_filt.fastq.gz N1_F_filt.fastq.gz N1_R_filt.fastq.gz

benjjneb commented 4 years ago

I ran the following without incident on my machine:

library(dada2); packageVersion("dada2")

[1] ‘1.12.1’

setwd("~/Desktop/hrivera")
filt <- list.files(pattern="F_filt.fastq.gz")
filt

[1] "DO5_F_filt.fastq.gz" "M2_F_filt.fastq.gz" "N1_F_filt.fastq.gz"

err <- learnErrors(filt, multi=TRUE)

Completes normally, took less than an hour on my 8 thread laptop.

hrivera28 commented 4 years ago

Hi Ben,

Odd! I'm happy it worked on your machine though. I'll give it another shot on my end...

Thanks so much for the help! -Hanny

On Wed, Oct 9, 2019 at 8:37 PM Benjamin Callahan notifications@github.com wrote:

I ran the following without incident on my machine:

library(dada2); packageVersion("dada2")

[1] ‘1.12.1’

setwd("~/Desktop/hrivera")

filt <- list.files(pattern="F_filt.fastq.gz")

filt

[1] "DO5_F_filt.fastq.gz" "M2_F_filt.fastq.gz" "N1_F_filt.fastq.gz"

err <- learnErrors(filt, multi=TRUE)

Completes normally, took less than an hour on my 8 thread laptop.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/847?email_source=notifications&email_token=ABYGRJX3ML4INX4SG6KROTTQNZ2MVA5CNFSM4I5IBPRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAZ5QKQ#issuecomment-540268586, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYGRJUEFAISBC7GNJ6QHUDQNZ2MVANCNFSM4I5IBPRA .

-- Hanny E. Rivera, Ph.D. Postdoctoral Associate and Lecturer Boston University, Biology Department Davies Marine Population Genomics Lab https://www.linkedin.com/in/hannyrivera http://sites.bu.edu/davieslab/members/

benjjneb commented 4 years ago

Closing as unreproducible. Feel free to reopen if there is more information though.

wufabai commented 4 years ago

Hi Ben, I am running into exactly the same problem using DADA2 version 1.14.1. I am processing a PacBio data set. I tried to either: 1) follow the tutorial and do errF <- learnErrors(filt, multithread=TRUE) with derep or 2) following the PacBio DADA2 NAR2019 paper to first derep and then do err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) In both cases, I have similar number (~80 million) of bases called for error estimate. This step just hang there. What is weird is that, instead of overwhelming my laptop, R is actually not using any CPU at all.

I am using a MacBook Pro 2017 8GB memory on Mojavi.

Thank you in advance, Fabai

benjjneb commented 4 years ago

@wufabai I was unable to reproduce this behavior previously, and that would be the key first step for making progress on this.

Can you share a reproducible example? That includes the sequencing data causing this behavior.

wufabai commented 4 years ago

@benjjneb Thanks for responding so quickly. I realized that I just needed to wait longer - took 2.5 hours for the 80M bases on my laptop.