benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

taxonomy annotation #82

Closed naarkhoo closed 8 years ago

naarkhoo commented 8 years ago

I was curious if you have any specific reason to only implement RDA taxonomy annotation and/but not blast ? People criticize RDA because of its lack of sensitivity.

Thanks.

benjjneb commented 8 years ago

Its mostly technical, the naive bayesian classifier is simple enough that I could reimplement it in a day, thereby avoiding any additional dependencies. Using BLAST will depend on an external program, complicating installation.

However, since everything is in R, it is possible to interface with external tools. The uniquesToFasta function makes this fairly easy for taxonomy assignment. For example:

uniquesToFasta(mySeqTable, "path/to/seq.fa", ids=my.seq.ids)
# Run external tax assignment on seq.fa

At that point you're left with the task of reading the output of the external tax assignment back into R. As long as the output is in tabular format, this is very straightforward by using read.table, but the specifics do depend on the tool in question.

naarkhoo commented 8 years ago

Thanks - I like the idea of "RSV" but I think, the current taxonomy information is not enough to understand and study all these variants.

If I understand, the authors argue that the 97% is an arbitrary threshold and might group variants that are biologically different, in the other hand, RSVs can be similar more than 97%. But the question/challenge is, given the current taxonomy information - we can't really distinguish and annotate these RSV's since both variants would have identical annotations. I wonder if you any comments on that - including idea in the wet lab.

spholmes commented 8 years ago

There may be hope using other tools than BLAST and RDP for the taxonomies, a new paper appeared in Bioinformatics that documents something called PROTAX. http://bioinformatics.oxfordjournals.org/content/early/2016/06/11/bioinformatics.btw346.abstract which seems to do a careful, probabilistic method.

On Fri, Jun 17, 2016 at 10:19 AM, naarkhoo notifications@github.com wrote:

Thanks - I like the idea of "RSV" but I think, the current taxonomy information is not enough to understand and study all these variants.

If I understand, the authors argue that the 97% is an arbitrary threshold and might group variants that are biologically different, in the other hand, RSVs can be similar more than 97%. But the question/challenge is, given the current taxonomy information - we can't really distinguish and annotate these RSV's since both variants would have identical annotations. I wonder if you any comments on that - including idea in the wet lab.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/82#issuecomment-226709547, or mute the thread https://github.com/notifications/unsubscribe/ABJcvRMJCqx8fLk9krr4NmxxexSqrF47ks5qMliBgaJpZM4I0g3p .

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

benjjneb commented 8 years ago

@naarkhoo

I agree that nearby variants often will not be differentiated in our current bacterial taxonomy. However, we can still observe biological signals related to those variants. Perhaps there are two strains of a bacteria in my samples, only one of which is related to poor health outcomes.

To be more concrete, the 16S rRNA sequences of some EHEC E. coli strains are within 3% of health E. coli strains, thus would be grouped into one OTU, but are distinguishable at high-resolution (eg. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4172979/).

They may both be classified as E. coli, but I would want to be able to distinguish the one causing hemorrhagic diarrhea if its possible!

joey711 commented 8 years ago

awesome example 👍

joey711 commented 8 years ago

(I would also prefer to distinguish garden variety E. coli from hemorrhagic diarrhea)

naarkhoo commented 8 years ago

@joey711 I totally agree; my main point was, even dada2 finds this variant, based on the current taxonomy databases, it is very difficult to annotate them. Thanks again.

benjjneb commented 8 years ago

The new assignSpecies function (http://benjjneb.github.io/dada2/species.html) uses exact matching to identify genus-species binomial names, and may in part address the issue raised here.