benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Disconnect between read quality profiles and read loss through the Dada 2 pipeline #1721

Closed pig-raffles closed 4 months ago

pig-raffles commented 1 year ago

Hi,

I previously posted about high amounts of reads being lost in certain samples during the chimera removal step of Dada2. It was suggested to try primer removal and this worked to a degree but now there are still high levels of read loss at the filter/trim stage and the reads merger stage, such that read loss levels per sample, at the end of the pipeline, remain mostly unchanged relative to read loss levels prior to primer removal.

I have read posts on this forum about which filter/trim parameters to alter to improve read retention. So, I chose my truncLen parameters based on the plotQualityProfile() visualisations, choosing truncLen parameters that were greater than the amplicon size + 25, but still removing the sections of the reads where the mean quality drops off (< 28). I sequenced 16S V4, with an amplicon size of 300 bp). I also increased the MaxEE parameter to 3,4.

I tried using @RemiMaglione Qual_vs_MaxEE_plot.R to visualize the most suitable MaxEE values to use. I ran this function with a range of samples (good and bad) and did not see much of a difference in which MaxEE values to use across different samples (most seemed < 2,2). I have attached two of these plots. One for a sample with ca. 89% read retention ("good") and one from a sample with ca. 6% read retention ("bad") through the Dada2 pathway.

Finally I tried different values of the maxMismatch parameter (1, 2) in the read merger step to see if this would improve read retention but this did not have much of an effect. Varying minOverlap (12, 20) had no effect

If I am understanding read quality profile and MaxEE plots correctly, there seems to be a disconnect between the trim or merge parameters suggested by these plots and the levels of read loss for certain of my samples over the course of the Dada2 pipeline. Whilst my bad samples don't have great quality profiles, when I choose apparently suitable truncLen and MaxEE values from eyeballing the plots, I still get massive read loss for these samples (80-94%). I was therefore wondering if anyone could explain what is going on here and or offer advice as to what other parameters I could tune to improve things.

Good_42SWPost_MaxEE Bad_08FWANT_MaxEE

Thanks in advance,

Alan

benjjneb commented 1 year ago

In your previous post https://github.com/benjjneb/dada2/issues/1715#issuecomment-1464743238 you indicated that you were using the EMP V4 primers, which do not sequence the primers at the start of the reads. What I didn't catch at that time is that you are using 2x300 sequencing. This is causing issues, because the amplified gene region by those primers is only 251-256 bps long. Thus, your 300nt reads are reading into the opposite primer and adapter sequences.

This is easily solved though. Make sure that truncLen is set to a value <= 250, to avoid that read-through. I also recommend removing cutadapt from your workflow, as this truncation step will take care of the primers towards the ends of the reads that you were seeing in https://github.com/benjjneb/dada2/issues/1715#issuecomment-1465088892

pig-raffles commented 1 year ago

Would it be worth truncating the reads to say 250 bp and then running Figaro on the truncated reads to find an optimized set of filter/trim parameters?

benjjneb commented 1 year ago

Not really. Just make sure truncation lengths are below 250 for both forward and reverse reads, and my suspicion is most of the issues you've been encountering will go away.

pig-raffles commented 1 year ago

Thanks again for all the advice.

I tried the following filter/trim parameters:

filter_out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(180,160), maxN=0, maxEE=c(1,1), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

Following this, I used the R commands for primer identification from the Dada ITS pipeline tutorial and found only 1 remaining primer sequence in the data set. Varying the truncLen parameters to remove this, did not help.

Read retention after the filter/trim and read merger steps were higher than previously. Unfortunately I am back to the same problem in that I lose a large amount of reads at the chimera removal step. So, it has come full circle and I am unsure how to proceed again

benjjneb commented 1 year ago

What are the state for reads passing through the chimera removal stage? E.g. the tracking stats in this section of the dada2 tutorial: https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline

Losing up to ~25% of reads as chimeras is not out of the real of normality, especially is using large PCR cycle numbers.

pig-raffles commented 1 year ago

The problem is that I have very variable rates of read loss among samples. Some sample lose on 10-20%, others lose 94%. The average read loss was 36%

The levels doesn't seem to be reflected in the read quality profiles or the MaxEE plots, by which I mean that the sample that lose lots of reads don't seem to be that bad quality in the two plot types mentioned (if I am interpreting them correctly). I have attached an image of the tracking stats, which gives a good overview of the level of read loss across different samples.

readtracking_dada2

benjjneb commented 1 year ago

That is hard to nail down. There seems to be a qualitative difference between the sample that are working as expected, and those that aren't, which are losing many reads at both the mergings step and the denoising step typically.

At this point a more open-ended investigation into what is going on in those samples vs. the "normal" samples would be appropriate. What are the top ASVs from those samples -- are they unlike those found in "normal" samples? Taking a subsample of reads from very poor performing samples (e.g 50) and BLAST-ing them against a broad database like nr/nt -- what are they? Alternatively, using a cruder but broader tool like Kraken to classify all reads in these samples might reveal unexpected components in poor performing samples, like off-target amplification of DNA from unexepcted taxa.

pig-raffles commented 1 year ago

Thank you for the detailed response.

Would it be best to use the discarded reads for this or reads still present at the end of the Dada2 pipeline? Is it possible to access the discarded reads for each sample?

benjjneb commented 1 year ago

Would it be best to use the discarded reads for this or reads still present at the end of the Dada2 pipeline?

If "this" means the analysis I proposed above, ASVs refer to the output of DADA2. Kraken would be used on the raw reads.

pig-raffles commented 1 year ago

Exactly what I needed to know. Thanks again