benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Can a variable start of amplicon sequence compromise learn Errors (or other subsequent functions)? #923

Closed RemiMaglione closed 4 years ago

RemiMaglione commented 4 years ago

Hello, I have variable length barcode and in my case I trimmed all my sequences to the maximum length (from my longest barcode), so I have for smallest barcoded sequences a lost of a few true biological sequences at the beginning and an alignment of trimmed sequences will look like this:

AAAGTTATCGGC (for longest barcoded sequence)
--AGTTATCGGC
------ATCGGC (for the shortest barcoded sequence)

From version 1.3.3 DADA2 allowed variable length amplicon for the dada aligner to deal with ITS amplicons: NEWS and the collapseNomismatch to merge pair with variable length BUT what about learnErrors ? Does it takes into account the variable length/start ? Like performing an alignment of the sub sampled sequences before the error learning ? This would begin the error learning score at position 7 for the shortest barcoded sequence instead of a position 1 from my above example.

To make it short: can a variable start of amplicon sequence compromise learn Errors (or other subsequent functions)? Thanks in advance, Rémi

benjjneb commented 4 years ago

We recommend that you trim the reads to a common starting position based on the primer sequence using an external program prior to running DADA2 in this case.

The variable start position doesn't invalidate DADA2, and the sequences you post above will be collapsed together appropriately, but it does lower the sensitivity to rarer variants which also causes error rates to be somewhat overestimated as well.