benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

A 20bps GAP in merged reads' length distribution. #1848

Closed gabrieltedone closed 4 months ago

gabrieltedone commented 10 months ago

First of all, thank you for implementing such a great tool and responding to all these issues!

While running the pipeline on 48 soil samples of 16S V3-V4 amplicons, I became slightly worried about the sequence length distribution of my merged reads (after removing chimeras with removeBimeraDenovo). Firstly, the merged sequences length distribution is wide, ranging from 269 to 480 nts in length. Also, it appears that most merged sequences have a length of either ~402bps or ~425bps.

> table(nchar(getSequences(seqtab.nochim)))

269  272  273  274  276  277  278  279  281  282  283  285  287  288  290  291  293  296  302  306  308  312  313  314  329  330  335  340  341 
   3    1    3    3    1    2    1    1    1    6    3    1    2    1    5    1    1    2    1    1    1    1    2    2    1    1    1    1    1 
 342  343  344  345  347  351  353  356  359  361  362  364  367  368  372  376  379  381  383  389  390  392  394  396  398  399  400  401  402 
   3    2    4    1    3    2    1    1    3   77    1    1    9    1    3    4    1    1    1    1    1    2    3    1    1    1    4   32 1589 
 403  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418  419  420  421  422  423  424  425  426  427  428  429  430  431 
 558  231  222   67  624   48   36   29   44    6   33    2    8   49   11   21   78   35  162 1738   38   52   56  622 1872  386   11    5    1 
 433  434  435  439  440  441  442  443  444  445  446  447  449  450  459  460  465  473  475  477  478  479  480 
   1    1    1    4    3    1    2    1    4    6    3    4    3    7    1    3    1    1    1    1    1    3    1 

Plotting this information with: z <- table(nchar(getSequences(seqtab.nochim))) w <- as.data.frame(z, col.names = c('sequence length','frequency')) plot(w)

I get the following graph: image

(sorry about the ugly graph)

Is this expected?

Extra Info:

Primers used: Pro341F (5′-CCTACGGGNBGCASCAG-3′) Pro805R (5′-GACTACNVGGGTATCTAATCC-3′)

Sequencing: Illumina MiSeq 2x300 bps

benjjneb commented 10 months ago

Is this expected?

Yes, the V3V4 region in bacteria has a bimodal length distribution that differs by about 20 nts.