benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

truncQ quality scores #1709

Closed ecologysarah closed 1 year ago

ecologysarah commented 1 year ago

Hi,

I am trying to understand exactly what truncQ does when trimming reads. According to the documentation, "Default 2. Truncate reads at the first instance of a quality score less than or equal to truncQ." What is this quality score? If it is a phred score, 2 is a VERY low bar to set. If not phred, what it is? And if lower scores correspond to lower quality, why do Rolling et al. (https://insight.jci.org/articles/view/151663) see more reads passing when truncQ is raised?

Any insight much appreciated, thanks!

benjjneb commented 1 year ago

truncQ is a very low bar, because we recommend maxEE as the primary quality filter. This is in part because truncQ type truncation introduces length variation into the data, which is undesirable with respect to maximizing sensitivity to low frequency variation in the subsequent denoising step.

why do Rolling et al. (https://insight.jci.org/articles/view/151663) see more reads passing when truncQ is raised?

Probably an interaction with maxEE filtering. Shorter reads (caused by truncQ truncation) have lower numbers of expected errors, and thus pass at a higher rate. In normal situations, the constant post-filter lengths of combing maxEE filtering with truncLen fixed length truncation is a preferable approach though.

benjjneb commented 1 year ago

Just checked the paper, and they are workign with ITS data, which is too length variable to apply truncLen style fixed-length truncation. In that setting, using truncQ is more appropriate, although we still recommend doing most if not all of the quality filtering via maxEE even there.

ecologysarah commented 1 year ago

Very helpful, thank you @benjjneb