Closed josenavas closed 7 years ago
is this still an issue?
Right now all the counts level are brought down. I was thinking if we correct the frequency back to the "true" sequences, how that affects the results.
Closing as this wasn't something we explored with the manuscript that I'm aware of. I think this is possibly a duplicate of #81. Reopen if necessary.
Hi deblur devs,
I am interested in the possibility of this issue being reopened, as I believe it could improve deblur
by generating more accurate relative abundance data. We've previously avoided deblur
in favour of DADA2
for our denoising work because of what we believe is the consequence of this issue. To explain a little further why we think it could be a problem: we recently generated amplicon data from the BioGEOTRACES transects (discussed in this paper) for which we had paired metagenomic / amplicon samples. We tried both q2-dada2
and q2-deblur
for denoising and compared the relative abundances of taxa between metagenomic SSU rRNA fragments recovered by phyloFlash and amplicon SSU rRNA. In essence we were just making scatterplots with one axis being MG SSU rRNA and the other amplicon SSU rRNA (some merging had to be done to account for the lower taxonomic resolution afforded by the short metagenomic fragments). For 16S, DADA2 consistently gave more accurate correlations between metagenomic and amplicon relative abundances which was a bit puzzling. Looking into it further, we believe the following is happening:
deblur
.DADA2
.While 1) above could be a problem since it may underestimate true sequence diversity, we would be willing to accept this as a potential tradeoff when dealing with noisy data for which deblur
seems to consistently outperform DADA2
in terms of removing sequencing noise. However, to the best of my knowledge, 2 & 3 result in data that no longer accurately reflect the true relative abundances of taxa in the sample. While this might be moot if you're doing some sort of log-transform of your data before further processing, it would be a serious issue if you wanted to back out quantitative copy numbers from amplicon data using e.g. an internal standard.
Do you have any thoughts on this? Would this be a trivial thing to implement and at least provide an option to the user to allow choice on how these sequences are dealt with?
Happy to provide more information on this if you think it would help. I have plots and ASV tables that I could share as well as raw sequences.
Thanks for your help, Jesse
Hi Jesse,
Thank you for the inquiry. To be honest, I'm unsure if this would be easy or hard to implement or how this would impact benchmarking.
A few follow up questions, if that's alright:
cc @antgonza
Best, Daniel
Hi Daniel,
Thanks for the quick reply and happy to answer the follow-up questions:
So to summarize - I think the issue really boils down to relative abundances being skewed due to the subtraction of putative sequencing noise. Since those subtracted abundances are not added back to the parent, it creates a situation where the quantitative nature of the data is potentially lost. The severity of this issue would probably vary sample-to-sample and may be more acute in some cases versus others and would depend on the properties of the sample.
Thanks again for looking into this, and looking forward to hearing your thoughts.
Best, Jesse
Thank you both; this is interesting.
I'm not sure exactly what would need to be modified to test this hypothesis; @jcmcnch, do you know? Basically, I'm wondering if it's something already exposed in the CLI, for example one of the options in deblur workflow --help
or deblur deblur-seqs --help
or something more "code involved".
Anyway, a few more questions:
qiime feature-classifier classify-sklearn
, right? Hi Jesse, and thanks for raising this issuer :) a few more thoughts: 1, Changing the deblur algorithm to assign back the (suspected) error reads to the non-noise sequences should be easy to implement (i.e., when removing noise reads, instead of just dropping them, add them to the real sequence from which they were decided as errors). However, this is not a perfect solution (can think of extreme cases where this naive approach will fail, since deblur is a greedy algorithm and we use an upper bound on the error profile). I think the main difficulty will be in validating whether (and in what cases) this change improves results, and what are the potential problems it may introduce.
cheers, Amnon
On Wed, Dec 1, 2021 at 12:58 AM Antonio Gonzalez @.***> wrote:
Thank you both; this is interesting.
I'm not sure exactly what would need to be modified to test this hypothesis; @jcmcnch https://github.com/jcmcnch, do you know? Basically, I'm wondering if it's something already exposed in the CLI, for example one of the options in deblur workflow --help or deblur deblur-seqs --help or something more "code involved".
Anyway, a few more questions:
- Are you using the same reference database to assign taxonomy (GG vs Silva) to the fragments produced by DADA2 and deblur? What about algorithm? I think you used vsearch for the publication vs. qiime feature-classifier classify-sklearn, right?
- Do you know the number of different fragments for each protocol? Basically, are you seeing a larger number with DADA2 than deblur?
- Thinking a bit more about 2, what about the length of the sequences? Looking at the paper is not clear if the fwd/rev reads were joined or if you just used the fwd.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/deblur/issues/5#issuecomment-983094758, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMQB4RQZ7HI34KW3CTXKRDUOVJITANCNFSM4A5GYPBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
We are only correcting by reducing the neighbor frequency, unsure how this affects if we increase back the frequency of the current sequence.