benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

From 1.4.0 to 1.9.3 - different results ? #610

Closed leasi closed 5 years ago

leasi commented 5 years ago

Hello !

It has been a while since I used DADA2, and recently updated it directly from version 1.4.0 to 1.9.3. Following the changelog, I understood there were some differences and bug corrections which should however not impact the results too much.

Analyzing the same Illumina run (V3-V4 amplicons) with the two versions, I noticed some variations in the ASV reference sequences, number of ASVs and total number of reads assigned in the ASV table. Globally, the 1.9.3 version generates more ASVs with less reads.

I had a look at what had changed in terms of parameters, so I could try to revert back to what would be 1.4 default equivalents (eg minOverlap=20, GAPLESS=FALSE, GREEDY=FALSE, MIN_PREVALENCE=2, OMEGA_C=0). However there is not that big a difference in the output, making me thing that the new parameters and change of default values are not causing the variations I see between versions.

The thing is that we complement our analysis by identifying specific SNVs on the ASVs reference sequences identified to the Bifidobacterium genus, to try to distinguish between bifidobacteria species (knowing the limitations of amplification & sequencing errors, but leading to very good results on in silico & in vivo mock communities).

Upgrading from 1.4.0 to 1.9.3 changes some ASVs reference sequences, and therefore has an impact on our complementary analysis.

While I understand that the updates in the core algorithm may induce those changes between versions, could you explain what in your opinion is the origin of these differences, and if we can rely on the newly generated results ?

Attached are the two ASV_ref_seqs.fasta from both versions, do not hesitate if you need any complementary information!

ASV_ref_seqs_1.4.0.txt ASV_ref_seqs_1.9.4.txt

Thanks for your help :)

benjjneb commented 5 years ago

Analyzing the same Illumina run (V3-V4 amplicons) with the two versions, I noticed some variations in the ASV reference sequences, number of ASVs and total number of reads assigned in the ASV table. Globally, the 1.9.3 version generates more ASVs with less reads.

That sounds about in line with expectations. The less reads is because of the change in error-correction defaults from "correct everything" to "correct everything above a likelihood threshold" set by OMEGA_C, and the increase in the number of ASVs is probably due to the slightly more conservative chimera detection defaults now implemented and will be most noticeable on multi-v-region data.

While I understand that the updates in the core algorithm may induce those changes between versions, could you explain what in your opinion is the origin of these differences, and if we can rely on the newly generated results ?

You've identified several of the relevant default parameter changes, but my guess would be that in your case the chimera removal changes had the biggest effect, and led to more ASVs in the final output because fewer ASVs were flagged as chimeric. We think the new defaults better balance false-positive and false-negative chimera identification, with the biggest positive impact being less false-flagging of intra-genomic variants as potential chimeras. This would suggest that in the 1.9.4 data you are seeing almost all the Bifidobacterium ASVs you saw in the 1.4.0 output, and some more that aren't removed as chimeras anymore. Is that roughly in line with what you are seeing?

As or which output is better, we only implement changes if we think they are improving accuracy, or improving something else (e.g. speed) without impinging on accuracy. However we can't test on every possible situation, and so you are asking good questions!

leasi commented 5 years ago

Thank you for your answer ! I just performed a test without the chimera removing step and indeed, the results are much more similar between the two versions so your hunch was right. I also tried changing the chimera filtering method from consensus to per-sample, with little impact on the results – so nothing to do with that parameter it seems.

Now that you helped identify the biggest impacting step, I'm still wondering about those "former-chimera-now-legit" ASVs generated after this change in chimera filtering. Indeed, for some of those who are now identified as Bifidobacterium (ex. sequence 54 and 62 in the previously attached ASV_ref_seqs_1.9.4.txt), we can't find our SNVs signatures in their representative sequences. Interesting point, this phenomenon does not occur on data from our mock community (a simple mix of several bifidobacteria species we want to identify), presumably because of less chances of chimera formation in such a clean low-complexity lab-generated sample?

In order to evaluate the new ASV sequences in our real datasets coming from infant gut, my colleague came up with this idea: knowing that bifidobacteria species from infant gut are very well characterized in 16S rRNA gene sequences databases, we should be able to retrieve our bifidobacteria ASVs sequences in such a database with high similarity. This would allow to “evaluate” the ASVs reference sequences independently of our own SNVs identification method.

So I blasted the bifidobacteria ASVs references sequences against NCBI 16S, and compared the similarities between our sequences and best hit (quick and dirty way), with DADA2 1.4.0 and 1.9.3. The results are summarized in the attached figure, hopefully it is clear enough.

bif_similarity

The thing that puzzles me is that the ASVs reference sequences that are newly generated in 1.9.3 (in yellow) seem more noisy, since they show less similarity with reference sequences (confirming what we observe with our SNVs signature method). This makes me think that those sequences (potentially previously filtered out as chimera) contain more errors, but perhaps I’m interpreting those results wrong? What would be your take on such observations?

And thank you again so much for your time :)

benjjneb commented 5 years ago

The thing that puzzles me is that the ASVs reference sequences that are newly generated in 1.9.3 (in yellow) seem more noisy, since they show less similarity with reference sequences (confirming what we observe with our SNVs signature method). This makes me think that those sequences (potentially previously filtered out as chimera) contain more errors, but perhaps I’m interpreting those results wrong? What would be your take on such observations?

That sounds like about what I would expect. In classification problems there is always a balance between false positivies and false negatives. The change we made to chimera classification made it more stringent, i.e. fewer chimeras are identified. We think this was a net benefit, because while a some more real chimeras will get through, there are also less true variants being flagged and removed. But it is not unexpected that the fraction of chimeras in that now-unflagged set of ASVs would have a higher rate of chimeras in it, hence your finding of it being somewhat "noisier" at least on comparison to a reference db.

leasi commented 5 years ago

Makes total sense, thanks for your confirmation! Could the stringency of the chimera classification step be adjusted on the user side by a parameter, or is it not possible because the changes between versions impact the core algorithm of chimera detection in itself?

benjjneb commented 5 years ago

You can adjust them. The most important parameter is minFoldParentOverAbundance which was changed from 1 to 2 back in 1.6 (if I remember right). We also changed allowOneOff from TRUE to FALSE at the same time. Then the method is also important, but I definitely recommend method="consensus" (the default) unless you are pooling for sample inference, in which case use method="pooled".

leasi commented 5 years ago

Indeed, reversing allowOneOff to TRUE gives results nearly identical to 1.4.0 ! (changing back minFoldParentOverAbundance=1 has little impact in comparison). Thank you for helping me tracking down the root cause of those differences :)