benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Taxon specific complete loss of reads using ITS pipeline #1025

Closed kek12e closed 4 years ago

kek12e commented 4 years ago

I believe this to be a sequencing artifact rather than an issue with DADA2, but perhaps you have advice on how best to mitigate this issue computationally. I did an experiment where I basically challenged two species of fungi, Leucoagaricus and Trichoderma, and then sampled at timepoints until the Trichoderma had overgrown the Leucoagaricus. Trichoderma produces tons of green spores so it’s very easy to detect observationally. However, after amplifying the ITS2 region from these DNA samples, sequencing on Miseq, and analyzing using the DADA2 ITS pipeline, my community abundance plots contained only Leucoagaricus and no Trichoderma. This was obviously strange because I knew there should be high abundance of Trichoderma at the late stage timepoints, so I went back to investigate the quality plots and track reads through the pipeline. There was characteristic low quality starting at ~125bp in the forward read and overall low quality in the reverse read associated with Trichoderma abundance. Below is just one example plot, but every sample that contains Trichoderma looks the same, even across multiple experiments and sequencing runs, to the point where I can identify Trichoderma just from looking at the sequencing quality plots.

alt text

There was also higher loss of reads after the filtering step and complete loss of reads after the merging step for samples high in Trichoderma which then results in community abundance plots that only contain Leucoagaricus. Below is a plot showing number of reads following each step of the DADA2 pipeline for a Trichoderma sample.

alt text

Based on the quality plots, I decided to try the pipeline using only the forward reads and trimming them to 125bp to get rid of low quality portions (truncLen = 125 for filterAndTrim()). This resulted in expected levels of Trichoderma. Below are the relative abundance plots from a synthetic community experiment where I pooled different ratios of Leucoagaricus to Trichoderma DNA. You can see how Trichoderma is completely undetected under the normal DADA2 trimming/merging process.

alt text

This very stringent trimming process fixed the problem, however I feel somewhat uncomfortable using only 125 bp in one direction to characterize microbial communities. Outside of these more controlled experiments this method results in many more NA taxonomy calls. It’s also a waste of the rest of the sequencing data we’ve already collected. Do you have any knowledge of this issue or advice for addressing it?

Thank you!

Andreas-Bio commented 4 years ago

I actually don't know enough about dada2 to give you any neat coding advice on noise reduction. Sorry. But for the wise guys to help you it is cruical to show your exact filtering commands.

But since you asked for computational advice: You could try another pipeline that merges first and filters for quality afterwards. Like V/Usearch. Aforementioned software uses both base calls in the overlapping region to determine the best solution if there is a conflict between forw and rev read. However, take results from noisy data always with a grain of salt. It might also be aforementioned suggestion introduces unacceptable levels of noise.

Some non-computational advice that you probably already know but somebody else googling this might find helpful: Using reads only in one direction is actually recommended in fungi studies. read: "Bioinformatics matters: The accuracy of plant and soil fungalcommunity data is highly dependent on the metabarcoding pipeline"

Anyway, it looks like your sequencing had some technical difficulties, as you already mentioned. But before contacting the sequencing company there are some steps to make sure everything is working on your end: 1) Check the GC content with a sliding window of 7bp and check if high GC content correlates with low sequencing quality. 2) Get your sequences (=single species isolations) and do an oldschool Sanger sequencing run before going into the expensive NextGen stuff. This tells you if anything failes/is too difficult BEFORE you go into library prep. 3) Download the ITS2 from GenBank and fold them. Check if any weird loops are positioned right where your sequence quality drops (this might indicate problems during PCR). OR check if your expected loop structure is missing, you might have bad luck and picked a strain with a large number of pseudogenes. This is what it might look like it it's okay: (attachment) grafik