Closed wasade closed 11 years ago
Yes I think this will be generally useful. Thanks!
On Nov 6, 2013, at 8:31 PM, Daniel McDonald notifications@github.com<mailto:notifications@github.com> wrote:
New methods for determining rare and unique taxa from a BIOM TaxonTable (e.g., from summarize_taxa.py). New tree objects were required since cogent is GPL, but this fell together pretty well.
@JWDebeliushttps://github.com/JWDebelius, I think this should be good to go for your uses. The tables are optionally filtered, as filtering out the rare/unique taxa from a BIOM table does add a fair amount of overhead. See the main block for an example of how to use the code, and of course do not hesitate to send questions.
@gregcaporasohttps://github.com/gregcaporaso @rob-knighthttps://github.com/rob-knight, this may be generally useful for QIIME as well. It's threshold based, and is not doing any stats, but the framework is a much more natural way to deal with taxonomy in these datasets. The tree could easily be further annotated as well. The central idea, given some tax strings:
tax_strings = ["kfoo; pbar; c123", "kfoo; pbar; c", "kfoo; pbar; c456", "kfoo; pother; c789"
The method constructs the tree:
(((c123,c456)pbar,(c789)pother)kfoo)root"
Taking this a little further, the method accepts taxon strings per sample, which allows you to annotate the tree to determine how many samples a particular node was observed in. It handles unclassified samples cleanly as well. It does not currently check for contested Greengenes groups, though that could be added easily.
You can merge this Pull Request by running
git pull https://github.com/wasade/American-Gut taxtree
Or view, comment on, or merge it at:
https://github.com/qiime/American-Gut/pull/26
Commit Summary
File Changes
Patch Links:
@adamrp @ElDeveloper Can you review please and merge if sane?
Thanks for comments, will address. Not sure if its appropriate right now to have a script interface, the use of argv
is just as a simplistic example. But, can gut if you think it'd be good to do so
Thanks Daniel, I think this looks really good. Minor comments mostly. Not sure of how best to get around that list lookup, although I do think it could end up bing a time-sink for very large trees.
Only for nodes with a large number of children, which of course could happen but should be niche. Set/dict would outperform once there were more than around 3-5 children. This will happen in taxonomy but the hierarchies are pretty small. For fasttree and phylogeny, the vast bulk of the nodes should be bifurcating
On Thu, Nov 7, 2013 at 1:53 PM, adamrp notifications@github.com wrote:
Thanks Daniel, I think this looks really good. Minor comments mostly. Not sure of how best to get around that list lookup, although I do think it could end up bing a time-sink for very large trees.
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/American-Gut/pull/26#issuecomment-28005347 .
@ElDeveloper @adamrp safe to merge?
:+1:
thanks
On Fri, Nov 8, 2013 at 11:09 AM, Yoshiki Vázquez Baeza < notifications@github.com> wrote:
[image: :+1:]
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/American-Gut/pull/26#issuecomment-28084241 .
New methods for determining rare and unique taxa from a BIOM
TaxonTable
(e.g., fromsummarize_taxa.py
). New tree objects were required since cogent is GPL, but this fell together pretty well.@JWDebelius, I think this should be good to go for your uses. The tables are optionally filtered, as filtering out the rare/unique taxa from a BIOM table does add a fair amount of overhead. See the main block for an example of how to use the code, and of course do not hesitate to send questions.
@gregcaporaso @rob-knight, this may be generally useful for QIIME as well. It's threshold based, and is not doing any stats, but the framework is a much more natural way to deal with taxonomy in these datasets. The tree could easily be further annotated as well. The central idea, given some tax strings:
The method constructs the tree:
Taking this a little further, the method accepts taxon strings per sample, which allows you to annotate the tree to determine how many samples a particular node was observed in. It handles unclassified samples cleanly as well. It does not currently check for contested Greengenes groups, though that could be added easily.