Closed Rdiazrua closed 5 months ago
The complexity of DADA2 roughly scales as the number of unique sequences in a sample, squared. Thus, when you have a very high-depth sample, the complexity goes up at some point as depth^2 (new errors will continue to propagate proportional to depth even after all real variants have been sampled).
learnErrors
restricts itself to using only the first 1e8
bases (or ~1M 100bp reads) for error-rate inference, but it takes each sample as a indivisible chunk, so if the first sample is much larger than that as is the case here, it still uses the whole samples.
The result of all that, is that learnErrors
becomes quite computationally expensive when run on a sample with 2B bases (10M reads).
But that computational complexity isn't really needed, you can get a good estimate of the error rates from a subsample of the data. learnErrors
does not subsample samples, but you can do that manually and it will improve coputational time substantially. I might try subsampling this to 5% of the reads, and then running learnErrors
on that.
Hi @benjjneb,
thank you very much for your prompt reply and sorry for the delay. I wanted to test it before answering and wasting your time.
The subsample worked for the learnErrors but I think that I am stuck again. Let me elaborate.
Initially, I tried to perform the subsample inside R studio using "FastqSampler" but the dada2 detected that it was an R object and gave me some errors. Finally I created (manually) I new file with only 5% of the reads outside R studio. I run dada2 until the learnErrors step with this subset and I kept the errF and errR objects in the environment (attached plots). After that I started again the pipeline with the full files skipping the learnErrors step for these full files. It worked (almost 3 days running), but I got some different messages and output in the console. Here you have the details:
derepFs <- derepFastq(filtFs, verbose=TRUE) derepRs <- derepFastq(filtRs, verbose=TRUE)
names(derepFs) <- sample.names names(derepRs) <- sample.names
After running the above step, the console showed something weird, like is not possible to remane the file
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE) #WAIT
dadaFs[[1]]
dadaRs[[1]]
These two steps provided me a different output that the described in the dada2 manual
Results that I usually get with smaller samples (An example):
In this time I got a series of sequences with a number (paste some of then) witout the above details. CGTTCGCTCCCCTAGCTTTCGTGCCTCAGCGTCAGAAAAGACCCAGTGAGCTGCTTTCGCCTCCGGTGTTCCTAATGATATCAACGCATTTCACCGCTCCACCATTAGTTCCGCTCACCTCTATCTCCCTCAAGCCATGTAGTTTTGAGCGCAGTTCCTCGGTTGAGCCGAGGGATTTCACACCCAACTTACAAAGCCGC 430 TGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTATCGGGCCAGTGAGCCGCCTTCGCTTCTGGTGTTCCTCCGAATATCTACGAATTTCACCTCTACACTCGGAATTCCACTCACCTCTCCCGAACTCCAGATTGCCAGTATCAAAGGCAGTTCCGAGGTTGAGCCCCGGCATTTCACCTCTGACTTAACAATCCGC 345 TGTTCGCTACCCACGCTTTCGTTCCTCAGCGTCAGTTACTGCCCAGAGACCCGCCTTCGCCACCGGTGTTCCTCCTGATATCTACGAATTTCACCTCTACACCGGGAATTCCACCCCCCTCTCCGAGACTCAAGCTAAGCAGTATCCGATGCACGTCCTCAGTTGAGCCGAGGGATTTCACATCGGACTTACCTAGCCGC
Then I continued: mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
head(mergers[[1]])
[1] "TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACCTGCAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGTTCGTAGGCGGTTTGTCGCGTCGTTTGTGAAAACCAGCAGCTCAACTGCTGGCTTGCAGGCGATACGGGCAGACTTGAGTACTGCAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAACGAAAGCGTGGGTAGCGAACA"
[2] "TGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATGCCGCGTGGAGGTAGAAGGCCCACGGGTCGTGAACTTCTTTTCTCGGAGAAGAAGCAATGACGGTATCTGAGGAATAAGCATCGGCTAACTCTGTGCCAGCAGCCGCGGTAAGACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTCAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAATG"
[3] "TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCGCTAGAGAAGATAATGACGGTATCTAGTAAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATATTTAAGTCAGGGGTGAAATCCCGGAGCTCAACTCCGGAACTGCCTTTGATACTGGGTATCTAGAGTATGGAAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGTCCATTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACA"
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
table(nchar(getSequences(seqtab)))
258 259 260 261 290 314 363 368 376 377 399 400 401 402 403 404 1 2 70 9 3 1 1 1 3 5 1 85 1244 70828 10404 3224 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 1969 1259 45686 1567 77 9 64 50 20 21 93 13 29 16 68 134 421 422 423 424 425 426 427 428 429 430 431 75 77 63 1747 56 620 16603 22384 408 9 4
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
Now, I am stuck removing chimaeras (almost two days running). I am not sure if I am carrying over a mistake or it can be a normal time for the type and amount of the data.
Do you think that the the size of the seqtab make sense?
My main worry/question is about the time to process 1 sample in a full Miseq run because I have several similar experiments. Even more, in the future, the idea is to run Novaseq 2 x 250 to have ultra-deep sequencing samples.
Thank you very much and again sorry if something basic is missing, I am still learning.
Please do not hesitate to request me any furhter information.
All the best, Ruben error_F_subsample.pdf error_R_subsample.pdf
On your first question, about the different output when looking at dadaFs[[1]]
, this is because when just a single sample is processed by the dada
function it returns a dada-class
object (implemented under the hood as a structure R list), while when multiple samples are processed by the dada
function it returns a list of dada-class
objects. This behavior is common in R, i.e. "simplifying" a list of objects to the bare object instead of a length 1 list of the objects. But it is confusing.
So, when you do dadaFs[[1]]
with your single sample, you are looking at the first element in the dada-class
object, while when you do that when processing multiple samples, you see the print output for the first dada-class
object. To get what you expect here, just dadaFs
will suffice.
On the second question on whether overall numbers and run-times are within expectations, yes they probably R. A large fraction of those 179k ASVs are likely chimeric, and the run-time for chimera removal can be substantial in this sort of case. More cores always helps, but another option is to restrict the ASVs being considered as chimeric "parents" with the minParentAbundance
parameter in removeBimeraDenovo
. A setting of e.g. 8 should still be effective.
My main worry/question is about the time to process 1 sample in a full Miseq run because I have several similar experiments. Even more, in the future, the idea is to run Novaseq 2 x 250 to have ultra-deep sequencing samples.
Admittedly we have not done as much as we could to optimize the computational complexity of dada
on super-deep samples. A Miseq-run size sample is definitely tractable (as you see), but run-time will increase something like quadratically with the number of additional reads present in a single sample. Stricter quality filtering will help, but the profusion of chimeras often starts to dominate the unique sequences in the sample as depth increases past a certain point, and that isn't removed by quality filtering.
Hi @benjjneb,
Thank you very much for your time.
I have decided to keep running the removeBimera until the weekend (5 days in total). If I can not see anything, I will stop it and run again with your minParentAbundance suggestion for the subset (5% of all the reads) first and then for the whole sample.
I will keep you in the loop.
Thanks again.
Best, Ruben
Hi @benjjneb,
It worked applying minParentAbundance=8!!!! I could also assign taxonomy. Now I am running 3 ultra-deep of these samples separately. Then I will merge in one file to create a unique table.
Could we keep it open for a couple of weeks? I would like to inform you if I have more difficulties.
Really appreciate your valuable help.
Thank you very much.
All the best,
Hello,
I have one Miseq run (2 x 300 bp) that I spent all the sequencing effort in just one sample (V3-V4_16S amplicon sequencing). After removing the primers using cutadapt, the amount of the data is 6.5 Gb for F and R files (each one).
After filter and trimer, I run the learnErrors script with default options: errF <- learnErrors(filtFs, multithread=TRUE)
The console shows the following:
2358584750 total bases in 9434339 reads from 1 sample will be used for learning the error rates.
It has been running for at least 2 days in a windows machine (Intel(R) Xeon(R) Gold 6230R CPU@2.1GHz-256 Gb-52 cores) without finishing.
When I run it in a couple of similar samples, but with much less data (around 90 Mb F and R files), it works fine in less than 5 minutes.
53734750 total bases in 214939 reads from 2 samples will be used for learning the error rates (attached plot)
Has someone experienced a similar problem? Can you advise me about the origin or how to solve it?
Sorry if something is missing; I am still learning.
Appreciate any help on that.
Best, Rubén
Rplot.pdf