Check the result variation if adding back the frequency gave to the neighbors

josenavas commented 9 years ago

We are only correcting by reducing the neighbor frequency, unsure how this affects if we increase back the frequency of the current sequence.

wasade commented 8 years ago

is this still an issue?

josenavas commented 8 years ago

Right now all the counts level are brought down. I was thinking if we correct the frequency back to the "true" sequences, how that affects the results.

wasade commented 7 years ago

Closing as this wasn't something we explored with the manuscript that I'm aware of. I think this is possibly a duplicate of #81. Reopen if necessary.

jcmcnch commented 2 years ago

Hi deblur devs,

I am interested in the possibility of this issue being reopened, as I believe it could improve deblur by generating more accurate relative abundance data. We've previously avoided deblur in favour of DADA2 for our denoising work because of what we believe is the consequence of this issue. To explain a little further why we think it could be a problem: we recently generated amplicon data from the BioGEOTRACES transects (discussed in this paper) for which we had paired metagenomic / amplicon samples. We tried both q2-dada2 and q2-deblur for denoising and compared the relative abundances of taxa between metagenomic SSU rRNA fragments recovered by phyloFlash and amplicon SSU rRNA. In essence we were just making scatterplots with one axis being MG SSU rRNA and the other amplicon SSU rRNA (some merging had to be done to account for the lower taxonomic resolution afforded by the short metagenomic fragments). For 16S, DADA2 consistently gave more accurate correlations between metagenomic and amplicon relative abundances which was a bit puzzling. Looking into it further, we believe the following is happening:

For some taxa in our samples (e.g. Prochlorococcus) there is high microdiversity such that some true semi-abundant variants are considered to be sequencing noise by deblur.
The sequencing counts for these putative denoising artifacts are not added back to the "parent" sequence, thus reducing the overall abundance of this broader taxonomic group. For taxa that have abundant true variants, this can comprise several percent of the overall # of amplicon reads.
This skews the relative abundances for all taxa, resulting in worse correlations to MG SSU rRNA vs. DADA2.

While 1) above could be a problem since it may underestimate true sequence diversity, we would be willing to accept this as a potential tradeoff when dealing with noisy data for which deblur seems to consistently outperform DADA2 in terms of removing sequencing noise. However, to the best of my knowledge, 2 & 3 result in data that no longer accurately reflect the true relative abundances of taxa in the sample. While this might be moot if you're doing some sort of log-transform of your data before further processing, it would be a serious issue if you wanted to back out quantitative copy numbers from amplicon data using e.g. an internal standard.

Do you have any thoughts on this? Would this be a trivial thing to implement and at least provide an option to the user to allow choice on how these sequences are dealt with?

Happy to provide more information on this if you think it would help. I have plots and ASV tables that I could share as well as raw sequences.

Thanks for your help, Jesse

wasade commented 2 years ago

Hi Jesse,

Thank you for the inquiry. To be honest, I'm unsure if this would be easy or hard to implement or how this would impact benchmarking.

A few follow up questions, if that's alright:

How do the correlations with WGS look when performing a naive 100% or 99% closed reference OTU clustering?
Given the inherent differences in the molecular protocols, is a higher correlation necessarily more correct?
Are there any ground truth observations here (e.g., mocks, simulated data, etc)?

cc @antgonza

Best, Daniel

jcmcnch commented 2 years ago

Hi Daniel,

Thanks for the quick reply and happy to answer the follow-up questions:

How do the correlations with WGS look when performing a naive 100% or 99% closed reference OTU clustering?
We haven't tried this, but my instinct is that it wouldn't change the results. The pipeline I made basically merges the ASVs into something like 95% OTUs anyway, since the short metagenomic reads will often match more than one ASV if they are closely related.
Given the inherent differences in the molecular protocols, is a higher correlation necessarily more correct?
If both looked equally bad then maybe you could argue that it's six of one and half a dozen of another, but my view is since the correlations look so much nicer with DADA2 that I don't think we can say that it's just due to inherent differences. DADA2 amplicons are pretty much spot on vs the MG.
Are there any ground truth observations here (e.g., mocks, simulated data, etc)?
Sort of - we were working under the assumption that the MG would be a ground truth for amplicons since one might naively assume they are more biased due to PCR. But perhaps someone out there has done a similar test on, say, the Zymo mock communities where they did both MG and amplicon samples. But even then, what would be the ground truth?

So to summarize - I think the issue really boils down to relative abundances being skewed due to the subtraction of putative sequencing noise. Since those subtracted abundances are not added back to the parent, it creates a situation where the quantitative nature of the data is potentially lost. The severity of this issue would probably vary sample-to-sample and may be more acute in some cases versus others and would depend on the properties of the sample.

Thanks again for looking into this, and looking forward to hearing your thoughts.

Best, Jesse

antgonza commented 2 years ago

Thank you both; this is interesting.

I'm not sure exactly what would need to be modified to test this hypothesis; @jcmcnch, do you know? Basically, I'm wondering if it's something already exposed in the CLI, for example one of the options in deblur workflow --help or deblur deblur-seqs --help or something more "code involved".

Anyway, a few more questions:

Are you using the same reference database to assign taxonomy (GG vs Silva) to the fragments produced by DADA2 and deblur? What about algorithm? I think you used vsearch for the publication vs. qiime feature-classifier classify-sklearn, right?
Do you know the number of different fragments for each protocol? Basically, are you seeing a larger number with DADA2 than deblur?
Thinking a bit more about 2, what about the length of the sequences? Looking at the paper is not clear if the fwd/rev reads were joined or if you just used the fwd.

amnona commented 2 years ago

Hi Jesse, and thanks for raising this issuer :) a few more thoughts: 1, Changing the deblur algorithm to assign back the (suspected) error reads to the non-noise sequences should be easy to implement (i.e., when removing noise reads, instead of just dropping them, add them to the real sequence from which they were decided as errors). However, this is not a perfect solution (can think of extreme cases where this naive approach will fail, since deblur is a greedy algorithm and we use an upper bound on the error profile). I think the main difficulty will be in validating whether (and in what cases) this change improves results, and what are the potential problems it may introduce.

Also note that singletons cannot be assigned back (at least hard for me to think of a way to do it), as they represent discrete results where statistics are much harder. This is why we throw away singletons even if they don't have any neighbors.
Hard for me to imagine a scenario like you described, with such high micro-diversity that it causes a bias when throwing away the suspected noise reads. Maybe the problem is that the deblur noise profile you are using is very aggresive? This usually happens when analyzing long reads (i.e > 150-200 bp) without trimming using the default deblur parameters. What read lengths did you use with your experiment? Did you trim the reads?
Can you share the fasta reads of a sample and the metatransciptomics assumed ground truth (and also the dada2 results of this sample)? This way we can try to track what happened to specific sequences during deblur, and see what contributes to the loss of correlation?

cheers, Amnon

On Wed, Dec 1, 2021 at 12:58 AM Antonio Gonzalez @.***> wrote:

Thank you both; this is interesting.

I'm not sure exactly what would need to be modified to test this hypothesis; @jcmcnch https://github.com/jcmcnch, do you know? Basically, I'm wondering if it's something already exposed in the CLI, for example one of the options in deblur workflow --help or deblur deblur-seqs --help or something more "code involved".

Anyway, a few more questions:

Are you using the same reference database to assign taxonomy (GG vs Silva) to the fragments produced by DADA2 and deblur? What about algorithm? I think you used vsearch for the publication vs. qiime feature-classifier classify-sklearn, right?

Do you know the number of different fragments for each protocol? Basically, are you seeing a larger number with DADA2 than deblur?

Thinking a bit more about 2, what about the length of the sequences? Looking at the paper is not clear if the fwd/rev reads were joined or if you just used the fwd.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/deblur/issues/5#issuecomment-983094758, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMQB4RQZ7HI34KW3CTXKRDUOVJITANCNFSM4A5GYPBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

biocore / deblur

Check the result variation if adding back the frequency gave to the neighbors #5