benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

Issue with learnerror method #1400

Closed Rdiazrua closed 5 months ago

Rdiazrua commented 3 years ago

Hello,

I have one Miseq run (2 x 300 bp) that I spent all the sequencing effort in just one sample (V3-V4_16S amplicon sequencing). After removing the primers using cutadapt, the amount of the data is 6.5 Gb for F and R files (each one).

After filter and trimer, I run the learnErrors script with default options: errF <- learnErrors(filtFs, multithread=TRUE)

The console shows the following:

2358584750 total bases in 9434339 reads from 1 sample will be used for learning the error rates.

It has been running for at least 2 days in a windows machine (Intel(R) Xeon(R) Gold 6230R CPU@2.1GHz-256 Gb-52 cores) without finishing.

When I run it in a couple of similar samples, but with much less data (around 90 Mb F and R files), it works fine in less than 5 minutes.

53734750 total bases in 214939 reads from 2 samples will be used for learning the error rates (attached plot)

Has someone experienced a similar problem? Can you advise me about the origin or how to solve it?

Sorry if something is missing; I am still learning.

Appreciate any help on that.

Best, Rubén

Rplot.pdf

benjjneb commented 3 years ago

The complexity of DADA2 roughly scales as the number of unique sequences in a sample, squared. Thus, when you have a very high-depth sample, the complexity goes up at some point as depth^2 (new errors will continue to propagate proportional to depth even after all real variants have been sampled).

learnErrors restricts itself to using only the first 1e8 bases (or ~1M 100bp reads) for error-rate inference, but it takes each sample as a indivisible chunk, so if the first sample is much larger than that as is the case here, it still uses the whole samples.

The result of all that, is that learnErrors becomes quite computationally expensive when run on a sample with 2B bases (10M reads).

But that computational complexity isn't really needed, you can get a good estimate of the error rates from a subsample of the data. learnErrors does not subsample samples, but you can do that manually and it will improve coputational time substantially. I might try subsampling this to 5% of the reads, and then running learnErrors on that.

Rdiazrua commented 3 years ago

Hi @benjjneb,

thank you very much for your prompt reply and sorry for the delay. I wanted to test it before answering and wasting your time.

The subsample worked for the learnErrors but I think that I am stuck again. Let me elaborate.

Initially, I tried to perform the subsample inside R studio using "FastqSampler" but the dada2 detected that it was an R object and gave me some errors. Finally I created (manually) I new file with only 5% of the reads outside R studio. I run dada2 until the learnErrors step with this subset and I kept the errF and errR objects in the environment (attached plots). After that I started again the pipeline with the full files skipping the learnErrors step for these full files. It worked (almost 3 days running), but I got some different messages and output in the console. Here you have the details:

derepFs <- derepFastq(filtFs, verbose=TRUE) derepRs <- derepFastq(filtRs, verbose=TRUE)

...F......Encountered 1686068 unique sequences from 9434339 total sequences read.

...R......Encountered 1236514 unique sequences from 9434339 total sequences read.

names(derepFs) <- sample.names names(derepRs) <- sample.names

After running the above step, the console showed something weird, like is not possible to remane the file

dadaFs <- dada(derepFs, err=errF, multithread=TRUE)

Sample 1 - 9434339 reads in 1686068 unique sequences.

dadaRs <- dada(derepRs, err=errR, multithread=TRUE) #WAIT

Sample 1 - 9434339 reads in 1236514 unique sequences

dadaFs[[1]]

dadaRs[[1]]

These two steps provided me a different output that the described in the dada2 manual

Results that I usually get with smaller samples (An example):

dada-class: object describing DADA2 denoising results

716 sequence variants were inferred from 52826 input unique sequences.

Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

In this time I got a series of sequences with a number (paste some of then) witout the above details. CGTTCGCTCCCCTAGCTTTCGTGCCTCAGCGTCAGAAAAGACCCAGTGAGCTGCTTTCGCCTCCGGTGTTCCTAATGATATCAACGCATTTCACCGCTCCACCATTAGTTCCGCTCACCTCTATCTCCCTCAAGCCATGTAGTTTTGAGCGCAGTTCCTCGGTTGAGCCGAGGGATTTCACACCCAACTTACAAAGCCGC 430 TGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTATCGGGCCAGTGAGCCGCCTTCGCTTCTGGTGTTCCTCCGAATATCTACGAATTTCACCTCTACACTCGGAATTCCACTCACCTCTCCCGAACTCCAGATTGCCAGTATCAAAGGCAGTTCCGAGGTTGAGCCCCGGCATTTCACCTCTGACTTAACAATCCGC 345 TGTTCGCTACCCACGCTTTCGTTCCTCAGCGTCAGTTACTGCCCAGAGACCCGCCTTCGCCACCGGTGTTCCTCCTGATATCTACGAATTTCACCTCTACACCGGGAATTCCACCCCCCTCTCCGAGACTCAAGCTAAGCAGTATCCGATGCACGTCCTCAGTTGAGCCGAGGGATTTCACATCGGACTTACCTAGCCGC

Then I continued: mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)

8895869 paired-reads (in 179003 unique pairings) successfully merged out of 9323490 (in 449126 pairings) input.

head(mergers[[1]])

[1] "TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACCTGCAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGTTCGTAGGCGGTTTGTCGCGTCGTTTGTGAAAACCAGCAGCTCAACTGCTGGCTTGCAGGCGATACGGGCAGACTTGAGTACTGCAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAACGAAAGCGTGGGTAGCGAACA" [2] "TGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATGCCGCGTGGAGGTAGAAGGCCCACGGGTCGTGAACTTCTTTTCTCGGAGAAGAAGCAATGACGGTATCTGAGGAATAAGCATCGGCTAACTCTGTGCCAGCAGCCGCGGTAAGACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTCAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAATG"
[3] "TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCGCTAGAGAAGATAATGACGGTATCTAGTAAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATATTTAAGTCAGGGGTGAAATCCCGGAGCTCAACTCCGGAACTGCCTTTGATACTGGGTATCTAGAGTATGGAAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGTCCATTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACA"

seqtab <- makeSequenceTable(mergers)

dim(seqtab)

[1] 1 179003

table(nchar(getSequences(seqtab)))

258 259 260 261 290 314 363 368 376 377 399 400 401 402 403 404 1 2 70 9 3 1 1 1 3 5 1 85 1244 70828 10404 3224 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 1969 1259 45686 1567 77 9 64 50 20 21 93 13 29 16 68 134 421 422 423 424 425 426 427 428 429 430 431 75 77 63 1747 56 620 16603 22384 408 9 4

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

Now, I am stuck removing chimaeras (almost two days running). I am not sure if I am carrying over a mistake or it can be a normal time for the type and amount of the data.

Do you think that the the size of the seqtab make sense?

My main worry/question is about the time to process 1 sample in a full Miseq run because I have several similar experiments. Even more, in the future, the idea is to run Novaseq 2 x 250 to have ultra-deep sequencing samples.

Thank you very much and again sorry if something basic is missing, I am still learning.

Please do not hesitate to request me any furhter information.

All the best, Ruben error_F_subsample.pdf error_R_subsample.pdf

benjjneb commented 3 years ago

On your first question, about the different output when looking at dadaFs[[1]], this is because when just a single sample is processed by the dada function it returns a dada-class object (implemented under the hood as a structure R list), while when multiple samples are processed by the dada function it returns a list of dada-class objects. This behavior is common in R, i.e. "simplifying" a list of objects to the bare object instead of a length 1 list of the objects. But it is confusing.

So, when you do dadaFs[[1]] with your single sample, you are looking at the first element in the dada-class object, while when you do that when processing multiple samples, you see the print output for the first dada-class object. To get what you expect here, just dadaFs will suffice.

On the second question on whether overall numbers and run-times are within expectations, yes they probably R. A large fraction of those 179k ASVs are likely chimeric, and the run-time for chimera removal can be substantial in this sort of case. More cores always helps, but another option is to restrict the ASVs being considered as chimeric "parents" with the minParentAbundance parameter in removeBimeraDenovo. A setting of e.g. 8 should still be effective.

My main worry/question is about the time to process 1 sample in a full Miseq run because I have several similar experiments. Even more, in the future, the idea is to run Novaseq 2 x 250 to have ultra-deep sequencing samples.

Admittedly we have not done as much as we could to optimize the computational complexity of dada on super-deep samples. A Miseq-run size sample is definitely tractable (as you see), but run-time will increase something like quadratically with the number of additional reads present in a single sample. Stricter quality filtering will help, but the profusion of chimeras often starts to dominate the unique sequences in the sample as depth increases past a certain point, and that isn't removed by quality filtering.

Rdiazrua commented 3 years ago

Hi @benjjneb,

Thank you very much for your time.

I have decided to keep running the removeBimera until the weekend (5 days in total). If I can not see anything, I will stop it and run again with your minParentAbundance suggestion for the subset (5% of all the reads) first and then for the whole sample.

I will keep you in the loop.

Thanks again.

Best, Ruben

Rdiazrua commented 3 years ago

Hi @benjjneb,

It worked applying minParentAbundance=8!!!! I could also assign taxonomy. Now I am running 3 ultra-deep of these samples separately. Then I will merge in one file to create a unique table.

Could we keep it open for a couple of weeks? I would like to inform you if I have more difficulties.

Really appreciate your valuable help.

Thank you very much.

All the best,