Correction of sequencing versus PCR errors

benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution

http://benjjneb.github.io/dada2/

GNU Lesser General Public License v3.0

469 stars 142 forks source link

Correction of sequencing versus PCR errors #878

Closed BrianGrillPhD closed 4 years ago

BrianGrillPhD commented 4 years ago

Does anyone know if DADA2 corrects sequence errors arising from PCR or just those that occur during Illumina sequencing? Looking at the documentation, it's clear that the algorithm corrects for sequencing errors, but I'm not seeing specific mentions of PCR errors. If PCR errors are not corrected, are people taking DADA2 output and running it through another pipeline like ObiTools?

This question is prompted by the observation that there are more ASVs in my dataset than I would expect based on known marker variation and the diversity of the study area. The data are trnL P6 reads sequenced 2 x 75 bp on an Illumina MiSeq for diet analysis. The data have been processed using settings based on the ITS tutorial (primers removed in all orientations and adjusted filter and trim settings). Could extra ASVs be resulting from biological length variation?

benjjneb commented 4 years ago

DADA2 corrects both PCR and sequencing errors.

This question is prompted by the observation that there are more ASVs in my dataset than I would expect based on known marker variation and the diversity of the study area.

The three main ways this could happen would be (1) more biological sequence variation than realized at this locus, (2) library preparation artefacts that introduce spurious variation, or (3) the algorithm itself not performing well in correcting errors.

The data have been processed using settings based on the ITS tutorial (primers removed in all orientations and adjusted filter and trim settings).

Is the trnL locus highly variable in length? If not, it might help a bit to truncate to a fixed length as per the regular tutorial. Also, are the paired end sequences overlapping? If so, how much?

BrianGrillPhD commented 4 years ago

Thanks for your response Ben.

Below is the output from plotErrors. What do you think?

Screen Shot 2019-11-06 at 3 20 51 PM

The trnL locus is pretty length variable. The reference library we have built for plants in this area ranges from ~10-150 with a modal length of 50. See below:

Reference Library

With a required 12 bp overlap, primers of lengths 17 and 22, and a 2 x 75 bp kit, we are thus incapable of getting anything over ~100 bp. The solution we've come up with to probe that is to process the data as single end and compare missing ASVs.

benjjneb commented 4 years ago

Those error plots look pretty normal, nothing that would raise any alarms for me.

The reference library we have built for plants in this area ranges from ~10-150 with a modal length of 50.

That is very variable... One thing that might be worth trying is to run collapseNoMismatch on the final sequence table (and please use 1.14.0 or later as a bug was fixed in that function recently). That would collapse any variation that was just based on different sequence lengths.

The solution we've come up with to probe that is to process the data as single end and compare missing ASVs.

Can you clarify... are you using just the forward reads? Or merging somehow? What does "compare missing ASVs" mean?

BrianGrillPhD commented 4 years ago

Great to know about collapseNoMismatch! It does not seem to have a huge effect on this dataset though (just 10 of 665 ASVs collapsed) indicating that length variation is probably not driving up the number of ASVs.

Looking at the ASVs assigned to one species, it seems like the algorithm really is finding support for lots of slight variations. See below.

Screen Shot 2019-11-06 at 4 16 41 PM

Re: Comparing output of single v paired end approaches. Because the DADA2 merging step will discard things that can't be merged because they have no overlap, we were comparing the ASVs output when we processed just the forward reads (presumably retaining representation of even long ASVs without overlap needed to merging if processed as paired end) versus processing the data as paired end (and allowing the program to remove things that can't be merged).