benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
462 stars 142 forks source link

Distanced - comments, opinions? #570

Closed GeoMicroSoares closed 5 years ago

GeoMicroSoares commented 5 years ago

Hi there,

The preprint for Distanced - https://www.biorxiv.org/content/biorxiv/early/2018/09/24/423186.full.pdf - has just been released.

Big take-home message on their benchmarking is that DADA2 seems to underestimate diversity and that reads would be better off left uncorrected. Honestly, I'm not too bothered. I'd rather rely on DADA2 to retrieve exact sequences for my dataset that are reproducible for future analyses for example.

I haven't looked at the code because am not a proper bioinformatician, but this seems like the place to start a discussion on this.

benjjneb commented 5 years ago

Thanks for pointing it out, I wouldn't have seen it otherwise.

At a high level, I like the idea of this paper. One way to think of denoising raw data to ASVs is as a general-purpose intermediate processing step that is useful for lots of different downstream purposes (richness, diversity, distances, differential abundance, computation time, cross-study comparison, etc.). However, if the sole goal is to optimize a single downstream metric, such as the authors here are doing with mean-pairwise-distance (MPD), it is reasonable to try working directly with the raw data and attempting to find an estimator that more accurately estimates the metric of interest than can be achieved by going through the ASV intermediate. In principle, one should be able to do better from the raw data, as there is more information in it than in the derivative ASVs.

A terminology quibble: I think framing their results as being about "diversity" is too general. For example, richness, which many people consider a diversity metric, is orders of magnitude better estimated with ASVs from DADA2 than the raw reads. Yet when "diversity" or "alpha-diversity" is discussed in most of the intro/discussion, what is being referred to is a specific diversity metric, mean-pairwise-diversity. It might help readers to be clearer on that point, especially since I at least haven't seen that metric used very often in microbial sequencing contexts.

I also wonder about the generalization of their method to real datasets in which oracle knowledge of the true composition is unknown. In the non-simulated results here, the known composition of mock communities was used to identify and remove contaminants and chimeras. That can't be done in real samples of unknown composition, and one of the advantages of ASVs over raw reads is that you can more effectively remove chimeras/contaminants. That's important, because I suspect in a lot of cases contaminants and chimeras will skew MPD estimates more than any sequencer error.

I have one major concern with the actual results they are presenting. I'm guessing that a single set of truncation length parameters was used for all the DADA2 processing, and that those lengths were inappropriate for the V3V4 region (too short so most real sequences failed to merge) and the ITS2 region (shouldn't truncate to a fixed length for ITS) and perhaps the antibody data (I don't know anything about it but maybe its long or of variable length?), and that is responsible for the poor MPD estimates in those datasets. For example, the DADA2 results look quite accurate for the V4 and V4V5 results (see FIg 3, Fig S8), but are way off for the V3V4 results in the same figures. That doesn't make sense on its face, but would be explained by using the same truncation length throughout, combined with the fact that the V3V4 amplicon is longer and thus many of the real sequences failed to overlap enough to merge. I can't say for sure, but I strongly suspect that those results are explained by misspecified truncation lengths rather than DADA2 behaving strikingly differently on V3V4 vs. V4V5 amplicons. One way to easily interrogate that would be to look at the fraction or reads successfully merging in each region. My hypothesis would be that it is much lower for the V3V4 data than the (e.g.) V4V5 data. Does that seem possible @thackmann ?

There could be a similar issue with the Deblur ITS results. Deblur requires truncation to a fixed length, so the optimal strategy for Deblur on variable length amplicons (usually) is to truncate reads to less than the shortest real sequence in the data to avoid dropouts. (although perhaps the Deblur experts @wasade or @amnona would like to correct me on that?)

thackmann commented 5 years ago

Thank you @GeoMicroSoares and @benjjneb for reading the pre-print and your thoughtful comments. I was at a conference yesterday and was not able to reply until now.

I agree that DADA2 is a useful tool for identifying amplicon sequence variants (ASVs), and we have used it extensively in our lab. It is also pretty good in estimating richness for many datasets.

Despite these strengths, DADA2 sometimes doesn't do a good job in estimating sequence abundances. The error is usually not large for any one ASV. However, there is an accumulation of errors across ASVs that results in what we found in the Distanced pre-print.

We are not the first to find that DADA2 and other tools distort abundances (see Fig. 2D in the Deblur paper). We are just the first to investigate how it impacts estimation of alpha diversity.

@benjjneb raised a number of concerns in the pre-print that are important to address.

Read truncation

We are open to suggestions for how to truncate reads, but we think we made good choices in the pre-print.

For DADA2, we followed the 16S and ITS tutorials. For bacterial 16S rDNA and antibody mRNA, we varied the value of trunLen to preserve at least 20 bp overlap. For fungal ITS rDNA, we set minLen to 50. Between 94.9 and 99.9% of reads merged, depending on the sequence type and region. More detail is provided in the supplemental datasets, which will be included in the manuscript once published.

For Deblur, we truncated reads to the length of the shortest reference sequence. We discarded any reads that were too short. We did this before Deblur to prevent reads from dropping out.

Chimeras and contaminants

We chose to remove chimeras with reference sequences because the method worked for the purpose at hand. This method is accepted at least as well as others. The DADA2 paper used the same method to determine if “Other” sequences were chimeras (see Supplementary Table 3 in that paper).

The method cannot be applied to real samples, but there are methods that can (e.g., UCHIME). We are considering switching to another method. If we do, it will be part of a revised pre-print.

@benjjneb seems to indicate that no method for chimera removal works well as for raw reads as for ASVs. We would be interested in seeing the comparison. If it has been published, we have overlooked it.

The larger problem with chimeras is that it is impossible to identify them with certainty (see here and here). Bioinformaticians can quibble over which method is best for removal, but it is really personal opinion whether a sequence is a chimera or not. We need to attack the problem at its source by using better PCR technique --we can’t rely on better bioinformatics alone.

The issue with removing contaminants is similar. I like the approach used in decontam, but it is still impossible to identify contaminants with certainty. Again, we need to attack the problem at its source. Illumina is working on reducing contamination from demultiplexing errors. Manufacturers of DNA extraction kits need to reduce contamination in reagents. In our experience, the REPLI-g single cell kit is already quite good.

Diversity metrics

Our results focus on a single diversity metric, mean pairwise distance (MPD), in part for simplicity. The goal is not to optimize a single downstream metric. In principle, Distanced could be used to correct many diversity metrics—many metrics are calculated using distances (see here and here). UniFrac, for example, is a measure of beta diversity that is calculated using distances.

We think that MPD is a good metric and should be used more often in microbial sequencing. As mentioned in the pre-print, some ecological relationships are apparent when using MPD but not richness.

benjjneb commented 5 years ago

@thackmann Any chance you could share the read numbers at each step for the DADA2 processing? I'm baffled by the differences between the V3V4 results (seemingly bad) and the V4 and V4V5 results (seemingly good). In every relevant figure (3/S2/S4/S6/S8) the V3V4 results look qualitatively different than the other regions, which doesn't match my experience with such data.

I see its from that old Kozich benchmarking study which makes it extra odd, because if I remember right the different regions in that data were all generated by amplifications from the same mock communities. But they did have a weird sequencing set where the V3V4/V4/V4V5 data was all mixed together in single samples, which could add its own funky complications perhaps...

Anyway, I'd appreciate it if you could share those summary stats as you've piqued my curiosity about this mystery!

thackmann commented 5 years ago

@benjjneb The read numbers at each step are reported in the supplemental datasets. These datasets were not included with the original pre-print. I know they would be helpful, but the journal reviewing the manuscript may not like them to be released at this juncture. They will appear with the published version of the manuscript, and you may end up seeing them beforehand.

Speaking generally, the read numbers met the general guidelines of the 16S tutorial: “Outside of filtering (depending on how stringent you want to be) there should no step in which a majority of reads are lost.”

For DADA2, I agree that results for the V34 region look qualitatively different from other 16S regions. For Deblur, however, the V34 region looks similar to the other regions. Deblur seems to handle this region better than DADA2 for this dataset.

Though DADA2 produced good estimates of MPD for the V45 region, it did not for the V4 region (see Fig. 5). It produced good estimates of richness for both regions.

As you mentioned, the reads for the V34, V4, and V45 regions were mixed together in single samples. We separated them according to length during merging. This is a straightforward process. You may have done something similar in the DADA2 paper when analyzing Mock1 from run 130403.

benjjneb commented 5 years ago

OK will look again when the full paper comes out. May your reviews be timely and appropriate!