benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

Optimizing filterAndTrim settings for different datasets (ITS2) when quality scores differ #1871

Closed ekronold closed 5 months ago

ekronold commented 10 months ago

Hi @benjjneb

I am currently working on some ITS2 datasets that have been generated separately from different substrates, and therefore differ slightly in the library preparation. The goal is to run dada2 and then be able to merge and cluster the different substrates into one OTU table.

I am running cutadapt to trim primers and to demultiplex before running dada2. For context, here is the quality scores of two different libraries that am i running:

Library 1: FWD.raw_qualityProfile.SS.pdf REV.raw_qualityProfile.SS.pdf

Library 2: FWD.raw_qualityProfile.SS.pdf REV.raw_qualityProfile.SS.pdf

While both have quite high quality reads, the second library has much higher overall quality than the first.

Since it is not viable to use any set truncation levels when ITS2 is so variable in length i have explored other options for removing the potential lower quality tails to improve the pairwise merging. I ended up using this option for the first library (same setting for both SS and AS):

filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(2,2), trimRight=10, truncQ=2, minLen = 50, rm.phix=TRUE, compress=TRUE, multithread=T)'

and this for the second library:

filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(2,2), truncQ=2, minLen = 50, rm.phix=TRUE, compress=TRUE, multithread=T)

So the only difference is that i am removing 10 nucleotides from the ends of each sequence in the library with lower quality-scores.

My question then is this, we have previously just been running without any trim regardless of quality. I see when i am setting the trimRight=10 i get quite a lot more ASVs out (10% increase in number of ASVs in particular for library 1). This is only the case for the first library though, when i add the trimRight=10 to the second library there is barely any change in amount of ASVs recovered (less than a 0.001% change).

I did some further tests by saving the ASV sequences from the first library when using the trimRight=10 setting and then without any trimming. When BLASTing the untrimmed sequences against the trimmed sequences i find that 99% of the untrimmed sequences are also present with 100% identity in the trimmed data. The trimmed dataset just outputs several additional ASVs. We have had a discussion in my group whether these additional ASVs are "real" diversity that is recovered by dada2 when reducing the amount of low-quality tails by trimming or if they are artefacts created by oversplitting. These new ASVs also resolve nicely when taxonomically assigning them.

We are having this discussion as we are hoping to use dada2 for generating these ASV tables separately for each substrate (there are more than just these two) and then merge and cluster as one single OTU table. While many of the "new" ASVs are in the lower quartile range of read abundance they might be important to keep as they might be present in differing levels of abundance in the other substrates.

From this i might think that for each dataset we should explore different settings here and then find an appropriate level of trimming where we get an "optimal" ASV table for further downstream analysis, or am i complicating things and making it so that the outputs are no longer comparable across datasets?

Note: both libraries contain ~180 samples and have been kept separate through both cutadapt and dada2

Thank you for any input you might have! Eivind

benjjneb commented 10 months ago

We have had a discussion in my group whether these additional ASVs are "real" diversity that is recovered by dada2 when reducing the amount of low-quality tails by trimming or if they are artefacts created by oversplitting. These new ASVs also resolve nicely when taxonomically assigning them.

I can't say for sure, but I have seen this behavior and probably what is going on is: The low quality tails means that DADA2 has a lower fraction of error-free reads to work with, and those error-free reads are what seed new ASVs. As a result, ASVs on the margins of detectability (low abundance, near other ASVs) are less sensitivitly detected when retaining the low quality tails.

From this i might think that for each dataset we should explore different settings here and then find an appropriate level of trimming where we get an "optimal" ASV table for further downstream analysis, or am i complicating things and making it so that the outputs are no longer comparable across datasets?

I think that second part is what's most important here. When you trimRight only one of the datasets, you can end up with ASVs that aren't properly combined when merging the datasets later (because of length variation). This doesn't matter so much if paired-end reads are merged, as the merged amplicon will have the same size even after trimming off a bit of the ends of the constituent reads, but is very important when working with single-end data.

Overall, some fiddling with the BI steps is fine, but don't get too bogged down there because computationally perfect is probably not attainable. And, remember to keep the batch information in the data as you move on to viz/stats, and check to see if that factor might be affecting your findings (e.g. checking stats models w/ and w/o batch to see if findings change).

ekronold commented 10 months ago

Thank you, this clarifies a lot of what we were discussing.