Open D-gallinson opened 1 year ago
I agree that these diagnostics, the very flat error rates with quality and the relatively large (>10%) or reads being lost at the denoising step, indicate something is not going right at that step. I'm not sure what it could be though. One thing I would check myself is whether the primers in the raw reads are always at the start of the reads as expected, with no variation in starting position. This can be seen by just looking at a fastq file.
I checked the fastq files, looks like there's an inconsistency with the primer sequences themselves. When the primer is intact, it's found in position 1 >99% of the time for all fastq files. However, ~17% of the forward primers and ~20% of the reverse primers are problematic (deletion or point mutation). For example,
CCTACGGGNGGCWGCAG <- forward primer
-CTACGGGTGGCTGCAG <- fastq seq (deletion pos 1)
ACTACGGGTGGCTGCAG <- fastq seq (C -> A transversion pos 1)
CCTA-GGGTGGCTGCAG <- fastq seq (deletion pos 5)
The cutadapt settings I used permit partial matches and thus removed these sequences, so I'm uncertain if this could be causing my problems. If it's worthwhile, I could try removing reads with primer mutations and re-running DADA2.
Hi @benjjneb and community,
I'm running through the standard DADA2 pipeline but lose >50% of my reads at the merging step. The sequencing setup was as follows:
Here's a count of my reads through each step:
Prior to filtering, I removed primers with cutadapt as follows:
B1.cutadapt.txt The above file represents a representative output, and it appears that cutadapt successfully removed all forward/reverse primers.
Next, for filtering, I used truncLen = c(228, 220) and maxEE = 2, rm.phix = T (all else left at default). The amplicon length without primers was ~426 nt, my truncLen should have yielded ~22 nt overlap. To my untrained eye, the reads prior to filtering appeared of adequate quality, after filtering the quality seemed quite good (see attached). I did lose a lot of reads here but still had a sufficient amount to proceed (and would prefer to drop potential false positives). ===Pre-filtered reads=== forward.read_quality.no_primers.trunc_228F_220R.maxEE_2.pdf reverse.read_quality.no_primers.trunc_228F_220R.maxEE_2.pdf ===Post-filtered reads=== forward.filtered.read_quality.no_primers.trunc_228F_220R.maxEE_2.pdf reverse.filtered.read_quality.no_primers.trunc_228F_220R.maxEE_2.pdf
Learning error rates was done at default settings, and although the model fits seemed sufficient, error frequency often did not decrease with increasing quality which was unexpected. forward.quality_vs_error_rates.no_primers.trunc_228F_220R.maxEE_2.pdf reverse.quality_vs_error_rates.no_primers.trunc_228F_220R.maxEE_2.pdf
I merged at default settings after denoising (also at default). The distribution of merged read lengths was as follows:
Other than losing the majority of my reads, this distribution didn't seem too unexpected. I also looked at the merge rejects and found almost no mismatches, all failed merges appeared to be due to a lack of overlap (and >95% of the failed merges had 0-1 nt overlap between forward and reverse reads).
So I'm not quite sure what's going on, but I suspect the difference in merge success between vsearch and DADA2's mergePairs indicates something's going wrong with denoising (and thus causing issues when attempting to merge the denoised reads). Is the failure of increasing read quality to predict decreasing error frequency causing a problem with the learned error rates (or indicating the presence of a problem)? I've seen a user suggest (Issue #831) using filterAndTrim, then using an external tool for merging (e.g., vsearch), and going back to DADA2 for chimera removal and onward, but I'm reluctant to throw out DADA2's denoising step.
Thanks for any help that can be provided!