benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Lower diversity with high complexity? #253

Closed adityabandla closed 7 years ago

adityabandla commented 7 years ago

Hi Benjamin

I am working on a sediment 16S dataset (V6-V8; Illumina PE 2x 300bp), using only the forward reads as the quality of the reverse ones are bad.

The number of sequence variants that DADA2 inferred were 3x lower than UNOISE2, which was quite surprising. Error rates were learnt using 9 samples (maxEE 6, length 250bp) and sample inference was done individually.

After soft filtering (minimum counts of 5 across all samples (total 27) and should occur in atleast 2 samples), the number of variants were ~6K, which seem to be bit low for sediments. UNOISE2 gave me 17k variants while UPARSE (97% OTUs) gave me 13k. Metagenome from the same data suggests atleast 10k species/strains, based on single copy genes.

Regards Aditya

benjjneb commented 7 years ago

Depending on your goal here, you may want to consider performing sample inference on the pooled samples (dada(..., pool=TRUE)) rather than per-sample, which is the dada2 default.

Pooled sample inference will be more sensitive to variants that appear at low frequencies across multiple samples: If a variant is present in just 1 or 2 reads in multiple samples, there might not be enough statistical evidence to call it in any individual sample, whereas there would be in the pooled sample. The cost is increased computation time. The biggest benefit of pooling probably is for hyper-diverse samples, where many variants are present at very low frequencies. We also have a brief discussion of pooling on our website.

If your goal is to compare dada2 and unoise2, you should run them in the same way -- either per-sample for both, or pooled for both. The default workflow for unoise2 is to pool everything, so if you want to compare dada2 results to that, you should turn on dada2's pooling option.

adityabandla commented 7 years ago

Hi Benjamin

Fair point on doing the comparison. I avoided pooling the samples as it was taking a lot of time. But probably I will pool the samples and post the results here

Its interesting that a similar trend is discussed here http://fiererlab.org/2017/05/02/lumping-versus-splitting-is-it-time-for-microbial-ecologists-to-abandon-otus/

benjjneb commented 7 years ago

Happy to take a look at what you find!

If you are getting into the weeds of comparing methods, the other thing to be aware of is the differences in the default chimera removal methods between dada2 and unoise2. In particular, unoise2 requires a higher minFoldParentOverAbundance=2, while dada2 has a default of 1. Furthermore, as of 1.4 dada2 performs "consensus" chimera removal (and this is what we recommend) where chimeras are identified in each sample independently and then a consensus decision on whether each SV is a chimera or not based on a "vote" across samples, whereas unoise2 removes chimeras from all samples pooled together.

However, these are just differences in the defaults, and you can make dada2's chimera removal essentially the same as unosie2 by calling removeBimeraDenovo(..., minFoldParentOverAbundance=2, method="pooled").

adityabandla commented 7 years ago

Hi Benjamin

Finally, I have some results. The number of raw sequence variants detected are about the same for both UNOISE2 and DADA2 (individual sample inference)

After soft filtering (min count across all samples = 5, min samples present in = 2), the number of variants from UNOISE2 are ~15k while for DADA2 it is ~7.5k. The rationale for the filtering parameters: I have a total of 27 samples and 3 biological replicates for each treatment

After varying the thresholds a bit, the massive decrease in variants counts, in the DADA2 case was due to the filtering parameter (min samples present in = 2). As you pointed out, this might be due to the lack of statistical power to call these variants in the other samples (in the individual sample inference mode). The number of variants is again similar to UNOISE2, when I remove this filtering parameter, but retain the minimum counts parameter

The pooled sample inference job has now been running for 2d+ (3M unique sequences). In my case, since I am studying some environmental gradients, it makes sense for some of the low frequency variants to be just above the counts threshold for some replicates and not others. The pooling method of UNOISE2 seems to be doing a good job of detecting these trends consistently. I am sure DADA2 would do as well, but is there anyway to make this faster?

Best, Aditya

benjjneb commented 7 years ago

Glad things are working for you!

On the long run -- 3M unique sequences is tractable, I've certainly run datasets like that from fecal samples, but its hard to guess exactly how long it will take because run time goes like N_UNIQUE_SEQUENCES x (N_REAL_VARIANTS + N_CHIMERAS), so it depends on how diverse your community is and how chimera-prone your PCR protocol was.

It's true that DADA2 is not as fast in UNOISE2 on large pooled samples right now, largely for technical reasons related to the heuristics used by each method (DADA2 uses a N-W aligner rather than a seed and extend aligner, DADA2 is not greedy, DADA2 allows unique sequences with abundance 2+ to be real variants instead of 4+ for UNOISE2). It may be worth developing a DADA2 mode that uses more heuristics that gain speed at the cost of a bit of accuracy for people like yourself who want to run pooled large datasets...

Right now the easiest thing to do is to filter more stringently (and make sure you are removing phiX!).