benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

Losing >50% of Reads on Merge #1832

Open D-gallinson opened 1 year ago

D-gallinson commented 1 year ago

Hi @benjjneb and community,

I'm running through the standard DADA2 pipeline but lose >50% of my reads at the merging step. The sequencing setup was as follows:

Here's a count of my reads through each step:

    input filtered denoisedF denoisedR merged nonchim
B1 164622    67269     52627     55123  11836    9627
B2 167349    68605     55257     56973  13057   10707
B3 166728    69021     58800     59336  19850   17158
F1 165971    69788     60063     59785  20446   17454
F2 164132    66491     56861     57099  18864   15741
F3 165077    65313     54217     55823  16844   13309
R1 186024    78065     65217     66600  17586   14642
R2 171404    70671     58896     59353  17300   14926
R3 183121    75811     64817     65480  20691   17332

Prior to filtering, I removed primers with cutadapt as follows:

cutadapt \
    -g CCTACGGGNGGCWGCAG \
    -G GACTACNVGGGTATCTAATCC \
    -o $forward_out \
    -p $reverse_out \
    $forward \
    $reverse > logs/$sample_name.cutadapt.txt

B1.cutadapt.txt The above file represents a representative output, and it appears that cutadapt successfully removed all forward/reverse primers.

Next, for filtering, I used truncLen = c(228, 220) and maxEE = 2, rm.phix = T (all else left at default). The amplicon length without primers was ~426 nt, my truncLen should have yielded ~22 nt overlap. To my untrained eye, the reads prior to filtering appeared of adequate quality, after filtering the quality seemed quite good (see attached). I did lose a lot of reads here but still had a sufficient amount to proceed (and would prefer to drop potential false positives). ===Pre-filtered reads=== forward.read_quality.no_primers.trunc_228F_220R.maxEE_2.pdf reverse.read_quality.no_primers.trunc_228F_220R.maxEE_2.pdf ===Post-filtered reads=== forward.filtered.read_quality.no_primers.trunc_228F_220R.maxEE_2.pdf reverse.filtered.read_quality.no_primers.trunc_228F_220R.maxEE_2.pdf

Learning error rates was done at default settings, and although the model fits seemed sufficient, error frequency often did not decrease with increasing quality which was unexpected. forward.quality_vs_error_rates.no_primers.trunc_228F_220R.maxEE_2.pdf reverse.quality_vs_error_rates.no_primers.trunc_228F_220R.maxEE_2.pdf

I merged at default settings after denoising (also at default). The distribution of merged read lengths was as follows:

 258  260  300  328  383  386  387  400  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418  419  420  421  422  423  424  425  426  427  428  429  430  431 
   2    7   15    3    1    1    1    7   42  674  554  131   98   35  166   22   70    7   47    1   23   17   10   19   47   15   47  412   90  201  322 3523  200 2413 4522 1713   65    4    3 

Other than losing the majority of my reads, this distribution didn't seem too unexpected. I also looked at the merge rejects and found almost no mismatches, all failed merges appeared to be due to a lack of overlap (and >95% of the failed merges had 0-1 nt overlap between forward and reverse reads).

Of note, I used vsearch to merge my filtered reads and obtained far better recovery (--fastq_maxdiffs 0 and --fastq_minovlen 12 were used to mimic DADA2's defaults of 12 nt overlap with 0 mismatches): Sample Pairs Merged (%)
B1 67269 54838 (81.5%)
B2 68605 55726 (81.2%)
B3 69021 56953 (82.5%)
F1 69788 56092 (80.4%)
F2 66491 53495 (80.5%)
F3 65313 52022 (79.7%)
R1 78065 64500 (82.6%)
R2 70671 57245 (81.0%)
R3 75811 62277 (82.1%)

So I'm not quite sure what's going on, but I suspect the difference in merge success between vsearch and DADA2's mergePairs indicates something's going wrong with denoising (and thus causing issues when attempting to merge the denoised reads). Is the failure of increasing read quality to predict decreasing error frequency causing a problem with the learned error rates (or indicating the presence of a problem)? I've seen a user suggest (Issue #831) using filterAndTrim, then using an external tool for merging (e.g., vsearch), and going back to DADA2 for chimera removal and onward, but I'm reluctant to throw out DADA2's denoising step.

Thanks for any help that can be provided!

benjjneb commented 1 year ago

I agree that these diagnostics, the very flat error rates with quality and the relatively large (>10%) or reads being lost at the denoising step, indicate something is not going right at that step. I'm not sure what it could be though. One thing I would check myself is whether the primers in the raw reads are always at the start of the reads as expected, with no variation in starting position. This can be seen by just looking at a fastq file.

D-gallinson commented 1 year ago

I checked the fastq files, looks like there's an inconsistency with the primer sequences themselves. When the primer is intact, it's found in position 1 >99% of the time for all fastq files. However, ~17% of the forward primers and ~20% of the reverse primers are problematic (deletion or point mutation). For example,

CCTACGGGNGGCWGCAG <- forward primer
-CTACGGGTGGCTGCAG <- fastq seq (deletion pos 1)
ACTACGGGTGGCTGCAG <- fastq seq (C -> A transversion pos 1)
CCTA-GGGTGGCTGCAG <- fastq seq (deletion pos 5)

The cutadapt settings I used permit partial matches and thus removed these sequences, so I'm uncertain if this could be causing my problems. If it's worthwhile, I could try removing reads with primer mutations and re-running DADA2.