benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
474 stars 144 forks source link

learnErrors() very time consuming #986

Closed hias90 closed 4 years ago

hias90 commented 4 years ago

Dear all,

I'm working on a metagenomics / metabarcoding project of soil samples. It involves Illumina MiSeq paired end sequences with 300 bp and 16S primers. I'm following DADA2 pipeline, however, the step learnErrors() is very time consuming. I assume it's because of the quality of my raw fastq files and many overrepresented sequences. It takes about 3 to 5 hours to finish the following command. At the moment I'm working on my desktop computer with 2 cores and 4 GHz. errF <- learnErrors(filtFs, multithread = TRUE)

31590960 total bases in 131629 reads from 1 samples will be used for learning the error rates.

My DADA2 filtering step is the following:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(250,180), trimLeft=c(10, 10),
                     maxN = 0, maxEE = c(2,2), truncQ = 2, rm.phix = TRUE,
                     compress = TRUE, multithread = FALSE)

I'm using cutadapt to remove the primers from my raw fastq files, prior to running dada2:

cutadapt -g AGAGTTTGATCMTGGCTCAG -a GACTCGGTMCTAGTTTGAGA -G GWATTACCGCGGCKGCTG -A GTCGKCGGCGCCATTAWG -n 2 -m 20 
-o Sample1-16S-V1_V3_CCTCCGGTTG-CATAACACCA_L001_R1_cutadapt.fastq.gz -p Sample1-16S-V1_V3_CCTCCGGTTG-CATAACACCA_L001_R2_cutadapt.fastq.gz 
Sample1-16S-V1_V3_CCTCCGGTTG-CATAACACCA_L001_R1.fastq.gz Sample1-16S-V1_V3_CCTCCGGTTG-CATAACACCA_L001_R2.fastq.gz

Forward Primer AGAGTTTGATCMTGGCTCAG Reverse Primer GWATTACCGCGGCKGCTG

Please see attached pictures of the FastQC reports (forward read only, before and after cutadapt) and the error rates for all possible base transitions after learnErrors().

As I said, I'm very confused about the duration of learnErrors() and whether or not I need to tune my raw fastq files.

Thank you very much

cutadapt raw 2nd_run_error rate base transitions

benjjneb commented 4 years ago

As I said, I'm very confused about the duration of learnErrors() and whether or not I need to tune my raw fastq files.

I don't see anything to be concerned about here. learnErrors can often be the the most time-consuming step, because it involves repeatedly performing sample inference on the samples until they reach convergence. But you just have to do it once for each run, and then you're set. If it's completing in a few hours on your laptop, all is well, and that is a normal amount of time for soil samples (which are more complex and thus take longer).

One other note:

trimLeft=c(10, 10)

If you have already removed your primers with cutadapt, I would recommend setting trimLeft=0 (the default). The purpose of trimLeft is really to remove primers and is unnecessary otherwise, and somewhat undesirable for directly comparing your ASVs with others using the same primer set.

hias90 commented 4 years ago

Thank you for that answer, it is very much appreciated.

hias90 commented 4 years ago

Hi there,

I would like to follow up on my previous questions. When using "truncLen=c(250,180)" I was not able to merge the reads. I changed the length to "truncLen=c(280,220)" which is about the maximum I would like to go to stay within Q30.

With that length I was able to merge about half my reads.

When checking for chimeras I got the following result:

sum(seqtab.nochim)/sum(seqtab) [1] 0.7272347

Now I'm unsure again about my settings. By reading other forum posts and the pipeline tutorial, it seems that loosing half the reads at merging and a further 28% as chimeras is an awful lot. Is that again something that can be expected with soil samples? Or should I further increase my "truncLen" and work with lower quality reads?

Many thanks for your support.

benjjneb commented 4 years ago

What is the biological length of the amplicon you are sequencing?

In order to merge, your total truncation lengths need to add up to more than the length + 12, and there should be some padding because there is biological length variation in regions of the 16S gene.

which is about the maximum I would like to go to stay within Q30.

There's nothing magical about staying within Q30. If you need to increase the truncation lengths to have enough sequence to merge, that is far more important.

hias90 commented 4 years ago

The amplicons are 300 bp long.

So I'm surprised why my first attempt with truncLen=c(250,180) wouldn't merge them properly.

angelRLbiowo commented 7 months ago

Hi @hias90 I know it's been a long time but right now I'm at the same step you took, and I was just looking for information because I've been waiting for 2 hours and I see that I have to do this step with forward and backward steps.

With your shared story, I am relieved that I just have to keep waiting. "> err_forward_reads <- learnErrors(filtered_forward_reads) "134932000 total bases in 539728 reads from 3 samples will be used to know the error rates."

Greetings.