Large variability in sequence lengths after merging paired reads

benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution

GNU Lesser General Public License v3.0

462 stars 142 forks source link

Hi,

I'm relatively new to bioinformatics and microbiology, and I've been following the DADA2 tutorial to process my 16S gDNA & eDNA sequencing data. After merging my paired reads and constructing the sequence table, I visualized the sequence lengths and noticed a considerable amount of variability. I'm unsure whether this variability is due to biological reasons or if it might be caused by technical issues, such as incomplete merging or sequencing errors.

Up until this point, everything else in the pipeline has looked good. I'm curious if this variability in sequence lengths is a common observation at this stage when working with the 16S marker. If anyone could offer some advise i would greatly appreciate it :)

Here's some additional information about my data: illumina MiSeq, 2x300 paired-end sequencing V3-V4 target region Primer set: FWD: CCTACGGGNGGCWGCAG, REV: GACTACHVGGGTATCTAATCC primers have been successfully removed

thanks in advance!

distribution of toatl reads by sequence length

Up until this point, everything else in the pipeline has looked good. I'm curious if this variability in sequence lengths is a common observation at this stage when working with the 16S marker.

Yes. First off, the two modes (peaks) of your sequence length distribution are expected. There is a natural bimodal length distribution of the V3-V4 16S rRNA gene region that differ by about 20 nts.

The various other lengths you observe is not uncommon, and typically comes from a mix of off-target amplification and library artefacts. It is completely valid to "cut a band in silico" and remove the ASVs outside the expected length distribution (this is described in the DADA2 tutorial, "Construct sequence table" section: https://benjjneb.github.io/dada2/tutorial.html

benjjneb / dada2

Large variability in sequence lengths after merging paired reads #2010