Closed hias90 closed 4 years ago
As I said, I'm very confused about the duration of learnErrors() and whether or not I need to tune my raw fastq files.
I don't see anything to be concerned about here. learnErrors
can often be the the most time-consuming step, because it involves repeatedly performing sample inference on the samples until they reach convergence. But you just have to do it once for each run, and then you're set. If it's completing in a few hours on your laptop, all is well, and that is a normal amount of time for soil samples (which are more complex and thus take longer).
One other note:
trimLeft=c(10, 10)
If you have already removed your primers with cutadapt, I would recommend setting trimLeft=0
(the default). The purpose of trimLeft
is really to remove primers and is unnecessary otherwise, and somewhat undesirable for directly comparing your ASVs with others using the same primer set.
Thank you for that answer, it is very much appreciated.
Hi there,
I would like to follow up on my previous questions. When using "truncLen=c(250,180)" I was not able to merge the reads. I changed the length to "truncLen=c(280,220)" which is about the maximum I would like to go to stay within Q30.
With that length I was able to merge about half my reads.
When checking for chimeras I got the following result:
sum(seqtab.nochim)/sum(seqtab) [1] 0.7272347
Now I'm unsure again about my settings. By reading other forum posts and the pipeline tutorial, it seems that loosing half the reads at merging and a further 28% as chimeras is an awful lot. Is that again something that can be expected with soil samples? Or should I further increase my "truncLen" and work with lower quality reads?
Many thanks for your support.
What is the biological length of the amplicon you are sequencing?
In order to merge, your total truncation lengths need to add up to more than the length + 12, and there should be some padding because there is biological length variation in regions of the 16S gene.
which is about the maximum I would like to go to stay within Q30.
There's nothing magical about staying within Q30. If you need to increase the truncation lengths to have enough sequence to merge, that is far more important.
The amplicons are 300 bp long.
So I'm surprised why my first attempt with truncLen=c(250,180) wouldn't merge them properly.
Hi @hias90 I know it's been a long time but right now I'm at the same step you took, and I was just looking for information because I've been waiting for 2 hours and I see that I have to do this step with forward and backward steps.
With your shared story, I am relieved that I just have to keep waiting. "> err_forward_reads <- learnErrors(filtered_forward_reads) "134932000 total bases in 539728 reads from 3 samples will be used to know the error rates."
Greetings.
Dear all,
I'm working on a metagenomics / metabarcoding project of soil samples. It involves Illumina MiSeq paired end sequences with 300 bp and 16S primers. I'm following DADA2 pipeline, however, the step learnErrors() is very time consuming. I assume it's because of the quality of my raw fastq files and many overrepresented sequences. It takes about 3 to 5 hours to finish the following command. At the moment I'm working on my desktop computer with 2 cores and 4 GHz.
errF <- learnErrors(filtFs, multithread = TRUE)
My DADA2 filtering step is the following:
I'm using cutadapt to remove the primers from my raw fastq files, prior to running dada2:
Forward Primer AGAGTTTGATCMTGGCTCAG Reverse Primer GWATTACCGCGGCKGCTG
Please see attached pictures of the FastQC reports (forward read only, before and after cutadapt) and the error rates for all possible base transitions after learnErrors().
As I said, I'm very confused about the duration of learnErrors() and whether or not I need to tune my raw fastq files.
Thank you very much