benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

truncLen parameter optimization for poor quality reads in filterAndTrim #274

Closed NinaKons closed 7 years ago

NinaKons commented 7 years ago

Dear ALL,

I was wandering about the truncLen parameter settings in filterAndTrim that you would suggest for the reads, the QC graphs of which are attached as the PDF files.

I used primers 341F and 805R primers to amplify V3-V4 regions of 16S rRNA. The amplicon has a size of ~450 pb and sequenced 2x300 V3-V4.

Many thanks in advance. Best regards. Nina

fnFs_quality.pdf fnRs_quality.pdf

benjjneb commented 7 years ago

Those are decent quality reads really, at least for the 2x300 chemistry.

If your amplicon is ~450 nts, you will want over 450+20+biological_length_variation nts of total sequence after truncation. The last 80 nts of your reverse reads looks bad. I think I would try truncLen=c(280,210) as my starting point. That should give you enough overlap between the reads (490nts).

pdhrati02 commented 3 years ago

Hi Benjamin, I am trying to use DADA2 (version 1.18) for my 16S sequenced data -- v3-v4 region, 2*300 bp chemistry. I read a lot of the issues on the github page, but still have a small confusion.

You mention in many posts that tunclen1 + trunclen2 should be >= ampliconlength+20 (in this case about 460+20). But also you mention in one post that for v3-v4 one should remove primer as: trimLeft=c(17,21). This would mean that the read is now about 280 bp long and not 300. And hence the truncLen would shift back by 20 bp. For example if in case of original reads (based on forward and reverse quality plot) the truncLen was 270,210 it will now be 250,190 is that right? In such cases how can one achieve the 470-480 bp merged read length at all?

For your reference please find attached a quality plot for my forward and reverse read. Originally looking at these I thought of using truncLen of 275, 210. But these reads still have the primers on them meaning I would have to further remove 20 from this truncLen-- the new parameters being: Trimleft=20,20 and trunclen=255,190. rev_qual.pdf for_qual.pdf How to deal with such situations?

I have another separate question (sorry for the mixing of topics here): Just to give it a try with the same dataset I went ahead with trimleft=25 and truncLen=c(270,205), maxEE=c(3,4) [tried with default maxEE as well]. This gave me results where in I had 30-60% read loss in total (after chimera removal). However in this case the sequence length distribution was max at 389-390 bp and then a small rise again at 410 bp. Is this completely wrong? And would it mean there are low quality bases in these reads because of the relatively relaxed trunclen? How would one confirm that?

Any help would be appreciated. Thank you DP

benjjneb commented 3 years ago

@pdhrati02 When calculating truncation lengths, one should use the sequence amplicon length. So, if primers were sequenced, that is part of the calculation there. For standard V3V4, the sequenced amplicon (including the sequenced primers) is ~460 nts. So no need to add the primer lengths in again, they are already accounted for.

This gave me results where in I had 30-60% read loss in total (after chimera removal).

It is more helpful to see the sequence loss throughout the whole workflow, as shown here: https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline

That is much more useful, because where reads are being lost is diagnostic about what is going one.

However in this case the sequence length distribution was max at 389-390 bp and then a small rise again at 410 bp.

V3V4 has a natural bimodal length variation with two modes about 20nts separated, so this is expected.

pdhrati02 commented 3 years ago

@benjjneb Thank you, so basically a sequence length distribution of ~390 bp is good? Or would you expect a different sequence length distribution for V3-V4 region?

Also I have few follow up questions:

  1. Using the parameters as mentioned earlier (trimleft=25 and truncLen=c(270,205), maxEE=c(3,4)), My sequence loss throughout the process looks okay for some samples (50-60% loss throughout) --do you think that is okay?, while as shown later there is huge loss of 95% for some as well. (I can share my tracked reads file with you if you wish to have a look at it). How to tackle this problem? These are all processed together and are part of the same run.

Sample | input | filtered | denoisedF | denoisedR | merged | nonchim |   | %nochim/raw 01T6 | 160195 | 125043 | 124146 | 124518 | 102547 | 93691 |   | 58.4856 02T12 | 163797 | 127107 | 126373 | 126719 | 112363 | 104213 |   | 63.62327 02T6 | 159185 | 121434 | 120942 | 121204 | 98125 | 96726 |   | 60.76326 03T6 | 169495 | 131338 | 130855 | 131107 | 102701 | 98549 |   | 58.14272

However in the same run I have a few samples which have a huge read loss:

Sample | input | filtered | denoisedF | denoisedR | merged | nonchim |   | %nochim/raw 28T12 | 111010 | 78673 | 78393 | 78556 | 5768 | 5768 |   | 5.195928 28T6 | 153568 | 108697 | 108346 | 108517 | 9556 | 9529 |   | 6.205069

  1. I tried dada2 as above and got very high number of ASVs : minFoldparentoverabundance=8 Identified 13121 bimeras out of 18704 input sequences. Thus ~5K ASVs. While when I tried Qiime1 I found somewhere about 1200 OTUs. (This is human gut microbiome - adult human stool samples were used). I do understand that there is a difference between ASVs and OTUs but would one expect such extreme differences? I just wish to confirm as I am afraid I might be going in the wrong direction here.

  2. Also, in the about 5K ASVs, about 1000 of them have NA at genus level. I used the 138.1 version of silva. How should one deal with such NAs at Genus level?

Thank you very much

Best regards DP

benjjneb commented 3 years ago
  1. That read tracking looks reasonable, most reads are being lost at the filtering step which is fine. There is a larger drop-off at the merging step than is ideal. I might consider slightly increasing the truncation lengths to add up to 480 and see if that helps at all (to get to sequenced amplicon length + 20 nts).

  2. Differences in total numbers of ASVs/OTUs between methods is not something to get particularly concerned about. Those total numbers are very sensitive to detailed thresholds and what not that affect the very rare taxonomic fraction.

  3. Depends on what you are trying to do. Could look at a higher taxonomic level where there will be fewer NAs. But, it is not always possible to determine the genus from short-read 16S.

pdhrati02 commented 3 years ago

@benjjneb Thank you very much for your response, however I would like to confirm if you think it is possible to increase TruncLen in this case? I am attaching my F and R quality plot once again for your reference. The 270 and 205 I was considering now was based on Q25. Do you think with these plots I can still increase truncLen? rev_qual.pdf for_qual.pdf

Thank you for all your help Best DP

pdhrati02 commented 3 years ago

Hi Benjamin, I tried increasing the truncLen as 275,210/ 280,200/ 275,205 but in all cases my filtered reads turned out to be lower than what I got when using 270,205. I think I will have to stick with these values. What do you reckon?

Sorry but one last question: While merging 2 different runs, does the maxEE need to be same for processing both runs?

Thank you for your help once again.

Best DP

benjjneb commented 3 years ago

Seems like a good data-driven choice of parameter values to me.

pdhrati02 commented 3 years ago

@benjjneb thank you very much, however what about the question of merging runs with different maxEE? Is that okay ?

Best DP

benjjneb commented 3 years ago

Yes that is OK.

pdhrati02 commented 3 years ago

Thank you very much for your prompt replies.

Thank you Best DP