benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Workaround options for read loss in filterAndTrim / premerging PE reads / maxEE flag #2031

Open ThomePauline opened 1 month ago

ThomePauline commented 1 month ago

Hi @benjjneb ,

after losing >60% of reads (43M total, Illumina PE 2x300 bp, 18S V4) through filterAndTrim(..., maxEE = 2, truncQ = 2, maxN = 0), we tried two workarounds to be able to keep more:

  1. setting maxEE = Inf, which is the flag causing the loss (as we systematically tested), which reduced the overall loss during DADA2 to 16%

  2. premerging before DADA2 and omitting the filterAndTrim() step altogether (being aware that you recommend against it), using USEARCH as pointed out by you previously (https://github.com/benjjneb/dada2/issues/327#issuecomment-400038014). Here we lose 30% overall (mainly during denoising)

We have two short questions about this:

  1. we would like to understand the maxEE flag better: if reads are truncated at the first base with a quality score of 2 or worse (truncQ), how can the overall expected error (maxEE) still be higher than this afterwards?

  2. judging from your experience, is option 1 or 2 the better workaround?

Thank you very much in advance!

benjjneb commented 1 week ago

Sorry for the slow response.

  1. we would like to understand the maxEE flag better: if reads are truncated at the first base with a quality score of 2 or worse (truncQ), how can the overall expected error (maxEE) still be higher than this afterwards?

You can read about expected errors in the paper introducing them as a filtering method here: https://doi.org/10.1093/bioinformatics/btv401

In your case, after the read is truncated (if it is) at the first position with Q=2, the expected errors are calculated by summing up the probability of an error based on the quality scores at all remaining bases in the read. If that number exceeds 2, it fails the maxEE=2 filter and is removed. We find this to be a better way to filter reads than the "average quality score" alternative.

  1. judging from your experience, is option 1 or 2 the better workaround?

Neither. See this previous comment for some explanation on why increasing read numbers by pushing through lower quality reads is not usually beneficial:

What you are trying to get out of 16S sequencing is a measurement of the membership in the microbial community and their relative abundances. The total number of reads that make it through the pipeline is only relevant in the sense that it can affect the accuracy of the measured relative abundances. All else equal, more reads is better because you reduce count noise in low abundance taxa and increase sensitivity to rare (~1 read) taxa. But here all else is not equal. By increasing maxEE dramatically you are keeping less accurate data. In practice, we usually see this leading to less accurate measurements of relative abundances at the end, because the low quality data that is being pushed through the pipeline just adds noise.

Shorter: The total read counts at the end don't matter, because everything is a relative abundance anyway. So your decisions on questions like this should be based on improving the accuracy of the relative abundance measurements, not on the total number of reads in the final table.