benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Long repeating regions of Cs and Gs #1646

Closed connor-morozumi closed 1 year ago

connor-morozumi commented 1 year ago

I imagine I am doing something like not removing adapters etc but I am hoping to get some guidance on sequences that come out of dada() like this:

"CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTAATGATACGGCGACCACCGAGATCTACACTGACGACATGGTTCTACAGTGTCAGCAGCCGCGGTACTCGAGCCCCTCCACACCCTAACCCTAATTCCCACCACAAATCTCTTACGCTGCCAACCCCAGCGCCGCCTAAGGTTCCTCTGAAGCGCCTCACGTCGGAGGAGATGGCGGTACGTAAAGACCAAGTCTCTGCTACCGTATACTGCAGCGATCTCGTATGCCGTCTTCTGCTTGAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"

What are some reasons Phred scores would be high in these tails? Here's an example:

@A00419:388:H3GLFDRXY:1:2251:15194:15906 1:N:0:1
CCCATTAGATACCCCCGTAGTCCAGACCAAGTCTCTGCTACCGTACGTAATGAGCATCTCGTTTGCCGTGTTCTGCTTGTAGAGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
,F,F:FFFFFFFFFF:FFFFFFF,F,F:F,,FF,FFFFFFFFFF,FFF:FFFFFFFFF::FF,FFF:FF,F::FF:FFF,,,:,,,,F:FFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFF:FF,FFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFF

These are NovaSeq with Fluidigm (16s, Archaea, and ITS1 and ITS2 primer sets) where the samples are soybean leaves.

Any assistance on what I am doing wrong (or even some keywords to google!) would be great thanks!

benjjneb commented 1 year ago

In Illumina two-color chemistries (NexSeq, NovaSeq, MiniSeq, ...) G is the absence of signal. This leads to the high-quality polyG tail issue. See here for a brief discussion: https://www.biostars.org/p/294612/

We haven't built a dedicated polyG tail solution, but the rm.lowcomplex option in filterAndTrim can help quite a bit. But it may be worth exploring pre-processing solutions that made specific efforts to combat this error type such as fastp, which could be run prior to entering the dada2 workflow.

connor-morozumi commented 1 year ago

Thanks, I will try those out. Cheers