benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

DADA2 denoise of PacBio sequences produced Warning: NAs produced by integer overflow #1839

Open emankhalaf opened 10 months ago

emankhalaf commented 10 months ago

Hi @benjjneb

I am processing over 1000 16S PacBio SMRT sequenced samples using DADA2 workflow in R, and I got this warning message from dada step using pool="pseudo" :

"Warning: NAs produced by integer overflowWarning: NAs produced by integer overflowWarning: NAs produced by integer overflow selfConsist step 2Warning: NAs produced by integer overflowWarning: NAs produced by integer overflowWarning: NAs produced by integer overflow"

However, the denoise step is finished successfully. So, my question is, is there anything to do with this warning message or just proceed in the workflow? I think this warning concerns R's capacity to denoise such a large number of samples, correct?

Thanks in advance!

Eman

benjjneb commented 10 months ago

This is a new one for me. Can you tell me more?

This warning did not come up in learnErrors? What was the exact dada call that produced the warning? What is the size of the input data? E.g. reads/lengths.

emankhalaf commented 10 months ago

@benjjneb

Thank you for your reply.

This warning did not come up in learnErrors? No.

What was the exact dada call that produced the warning? Please see below:

dd <- dada(filts, err=err, pool="pseudo", BAND_SIZE=32, OMEGA_A=1e-10, DETECT_SINGLETONS=FALSE, multithread=TRUE)

What is the size of the input data? E.g. reads/lengths.

50 GB, approx. 1020 samples. I am following the DADA2 tutorial for PacBio reads. I will copy these codes here:

filts <- file.path("home/Documents/sequences", "noprimers", "filtered", basename(fns))
track <- filterAndTrim(nops, filts, minQ=3, minLen=1300, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
err <- learnErrors(filts, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)

Now, I exported the feature table, tax table, sequences, alignment, and the tree as rds, and I exported the phyloseq object as well. The sequences used to construct the tree are 6823 sequences.

Are there any concerns regarding the warning that I posted above from dada step?

I hope I have described the issue clearly. Thanks!

benjjneb commented 10 months ago

Now, I exported the feature table, tax table, sequences, alignment, and the tree as rds, and I exported the phyloseq object as well. The sequences used to construct the tree are 6823 sequences.

So after the warning, you still are getting what looks like valid output?

Are there any concerns regarding the warning that I posted above from dada step?

I am a bit concerned, as integer overflows on the R side (which I'm guessing is what is sparking this error) could lead to incorrect output. What is the size of the sequence table you are getting after denoising here?

emankhalaf commented 10 months ago

@benjjneb 23.8 MB is the size of the sequence table output from the denoising step. Is it reasonable for this step?

benjjneb commented 10 months ago

By size, I mean dim(seqtab). And maybe also summary(nchar(getSequences(seqtab)))

emankhalaf commented 10 months ago

@benjjneb

dim(seqtab)is 1019 6907 (1019 samples having 6907 sequences)

summary(nchar(getSequences(seqtab))) Min. 1st Qu. Median Mean 3rd Qu. Max. 1302 1412 1437 1434 1462 1596

benjjneb commented 9 months ago

Hmm, that doesn't seem to be a problematic size.

One guess is that there could be an overflow happening in the pseudo-pooling code. Would you be able to test that by running the dada(...) command again that produced the warning, but with pool=FALSE, to see if the warning goes away?