benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Few number of reads merged #982

Closed ShailNair closed 4 years ago

ShailNair commented 4 years ago

Hi Ben, I know this issue has been posted many times. I tried to follow older posts but could not resolve my issue. I am running DADA2 pipeline for my v3-v4 amplicon analysis with F primer: 343F (5'- TACGGRAGGCAGCAG -3) , R primer: 798R (5'- AGGGTATCTAATCCT-3'). I get through all steps smoothly until I merge the pairs. After merging I can see: abundance forward reverse nmatch nmismatch nindel prefer accept 289 3 4 66 143 0 0 1 TRUE 315 3 71 2 151 0 0 2 TRUE

Further, when I check for all the reads that made passed, I get very few reads merged input filtered denoisedF denoisedR merged nonchim A1 49314 41640 41114 41150 6 11 A2 59182 50052 49338 49487 0 0 B1 42725 36074 35686 35770 5 8 B2 37672 28731 28255 8324 4 4 C1 59203 49415 48762 48836 31 29 C2 81965 66657 65767 65973 44 53

Is it normal or something wrong with my filtering and trimming step? (I did not proceed with further steps)

Here is my filter and trim code:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(210,190), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE,multithread=TRUE, trimLeft = c(15, 15)) head(out)

I tried increasing truncLen=c(230,200) but no much difference in merged reads.

I have attached sequence quality plots for your reference.

Regards,

fwd.pdf rvrs.pdf

benjjneb commented 4 years ago

How long is your sequenced amplicon expected to be? If it is longer than the total of your truncation lengths, then the reads don't overlap and can't merge. (you need an extra 12nts as well for overlap).

Illumina V3V4 is ~460nts, you are using 798R instead of 785R, so your sequenced amplicon will be another ~13nts. So, ~473nts sequenced amplicon + 12 nts overlap means your truncation lengths need to add up to ~490 nts or more.

ShailNair commented 4 years ago

I thought the v3v4 amplicon size to be ~440-450nts. Is there any method to know exactly how much is the actual amplicon length? Secondly, if I increase the truncation length to add up to ~490 nts I have to take extra length which according to quality plots, has a bad score (less than 30). Will that be a problem?

benjjneb commented 4 years ago

Is there any method to know exactly how much is the actual amplicon length?

One way is to extract the amplified region from a collection of 16S sequences (e.g. RDP or the like) and inspect the length distribution. V3V4 has significant biological length variation, which includes two modes about 20nts different. For your amplicon this will show up as a mode at ~450nts and another at ~470 nts. You don't want to throw either away.

Secondly, if I increase the truncation length to add up to ~490 nts I have to take extra length which according to quality plots, has a bad score (less than 30). Will that be a problem?

Lower quality sequencing never helps, but this doesn't invalidate anything. Remember that dada2 uses a quality aware error model, so it handles lower quality sequencing pretty well. The main cost is that you will lose some sensitivity to rare variants from having fewer error-free reads in the data.

ShailNair commented 4 years ago

Thanks. I increased the truncation length and it worked. now I can get more than 80% of merged reads.

ShailNair commented 4 years ago

I am quite a newbie to dada2. I am following DADA2 1.12 tutorial. I was wondering to include "DADA2 pooling" to my data. In order to do so, do I need to mention "pool=TRUE" only while applying the core sample inference algorithm or do I need to mention it in all subsequent steps?

Secondly, is there any possible method to run the pipeline with "pool=true" with already analyzed data ( maybe can resume from somewhere middle) or do in need to rerun the whole pipeline?

benjjneb commented 4 years ago

@razz1618 Just in the dada(...) step, although if you use dada(..., pool=TRUE) we recommend that you also use pooling to remove chimeras, i.e. removeBimeraDenovo(..., method="pooled").

Secondly, is there any possible method to run the pipeline with "pool=true" with already analyzed data ( maybe can resume from somewhere middle) or do in need to rerun the whole pipeline?

You can start from the already filtered files quite easily.

ps: pool=TRUE can be quite computationally taxing for large studies (i.e. whole MiSeq lane or bigger). In that case pool="pseudo" is a good option.

ShailNair commented 4 years ago

Thanks...As my data isn't big..it went smoothly with pool=TRUE