biocore / American-Gut

American Gut open-access data and IPython notebooks
Other
113 stars 81 forks source link

methods for fetching unique/rare taxa #26

Closed wasade closed 11 years ago

wasade commented 11 years ago

New methods for determining rare and unique taxa from a BIOM TaxonTable (e.g., from summarize_taxa.py). New tree objects were required since cogent is GPL, but this fell together pretty well.

@JWDebelius, I think this should be good to go for your uses. The tables are optionally filtered, as filtering out the rare/unique taxa from a BIOM table does add a fair amount of overhead. See the main block for an example of how to use the code, and of course do not hesitate to send questions.

@gregcaporaso @rob-knight, this may be generally useful for QIIME as well. It's threshold based, and is not doing any stats, but the framework is a much more natural way to deal with taxonomy in these datasets. The tree could easily be further annotated as well. The central idea, given some tax strings:

tax_strings = ["k__foo; p__bar; c__123",
                     "k__foo; p__bar; c__",
                     "k__foo; p__bar; c__456",
                     "k__foo; p__other; c__789"

The method constructs the tree:

(((c__123,c__456)p__bar,(c__789)p__other)k__foo)root"

Taking this a little further, the method accepts taxon strings per sample, which allows you to annotate the tree to determine how many samples a particular node was observed in. It handles unclassified samples cleanly as well. It does not currently check for contested Greengenes groups, though that could be added easily.

rob-knight commented 11 years ago

Yes I think this will be generally useful. Thanks!

On Nov 6, 2013, at 8:31 PM, Daniel McDonald notifications@github.com<mailto:notifications@github.com> wrote:

New methods for determining rare and unique taxa from a BIOM TaxonTable (e.g., from summarize_taxa.py). New tree objects were required since cogent is GPL, but this fell together pretty well.

@JWDebeliushttps://github.com/JWDebelius, I think this should be good to go for your uses. The tables are optionally filtered, as filtering out the rare/unique taxa from a BIOM table does add a fair amount of overhead. See the main block for an example of how to use the code, and of course do not hesitate to send questions.

@gregcaporasohttps://github.com/gregcaporaso @rob-knighthttps://github.com/rob-knight, this may be generally useful for QIIME as well. It's threshold based, and is not doing any stats, but the framework is a much more natural way to deal with taxonomy in these datasets. The tree could easily be further annotated as well. The central idea, given some tax strings:

tax_strings = ["kfoo; pbar; c123", "kfoo; pbar; c", "kfoo; pbar; c456", "kfoo; pother; c789"

The method constructs the tree:

(((c123,c456)pbar,(c789)pother)kfoo)root"

Taking this a little further, the method accepts taxon strings per sample, which allows you to annotate the tree to determine how many samples a particular node was observed in. It handles unclassified samples cleanly as well. It does not currently check for contested Greengenes groups, though that could be added easily.


You can merge this Pull Request by running

git pull https://github.com/wasade/American-Gut taxtree

Or view, comment on, or merge it at:

https://github.com/qiime/American-Gut/pull/26

Commit Summary

File Changes

Patch Links:

wasade commented 11 years ago

@adamrp @ElDeveloper Can you review please and merge if sane?

wasade commented 11 years ago

Thanks for comments, will address. Not sure if its appropriate right now to have a script interface, the use of argv is just as a simplistic example. But, can gut if you think it'd be good to do so

adamrp commented 11 years ago

Thanks Daniel, I think this looks really good. Minor comments mostly. Not sure of how best to get around that list lookup, although I do think it could end up bing a time-sink for very large trees.

wasade commented 11 years ago

Only for nodes with a large number of children, which of course could happen but should be niche. Set/dict would outperform once there were more than around 3-5 children. This will happen in taxonomy but the hierarchies are pretty small. For fasttree and phylogeny, the vast bulk of the nodes should be bifurcating

On Thu, Nov 7, 2013 at 1:53 PM, adamrp notifications@github.com wrote:

Thanks Daniel, I think this looks really good. Minor comments mostly. Not sure of how best to get around that list lookup, although I do think it could end up bing a time-sink for very large trees.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/American-Gut/pull/26#issuecomment-28005347 .

wasade commented 11 years ago

@ElDeveloper @adamrp safe to merge?

ElDeveloper commented 11 years ago

:+1:

wasade commented 11 years ago

thanks

On Fri, Nov 8, 2013 at 11:09 AM, Yoshiki Vázquez Baeza < notifications@github.com> wrote:

[image: :+1:]

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/American-Gut/pull/26#issuecomment-28084241 .