biocore / microsetta-public-api

A public microservice to support The Microsetta Initiative
BSD 3-Clause "New" or "Revised" License
2 stars 4 forks source link

Taxonomy bp tree to genus tree #96

Closed gwarmstrong closed 2 years ago

gwarmstrong commented 3 years ago

This indexes a genus level taxonomy tree with propagated labels in bp format. This change affects the empress endpoint, giving a smaller tree.

There is a function create_tree_node_from_lineages, which is similar to skbio.TreeNode.from_lineages, but is preferred because TreeNode.from_lineages has some undesirable behavior for this application, e.g., the repeated c__c node:

>>> from skbio.tree import TreeNode
>>> lineages = [
... [('c__c', ['k__a', 'p__b']),
...  ('o__d', ['k__a', 'p__b', 'c__c']), 
...  ('o__h', ['k__a', 'p__f', 'c__g'])]
...  ]
>>> tree = TreeNode.from_taxonomy(lineages)
>>> print(tree.ascii_art())
                              /-c__c
                    /p__b----|
--------- /k__a----|          \c__c---- /-o__d
                   |
                    \p__f---- /c__g---- /-o__h
wasade commented 3 years ago

Does the following change on input create the same result?

In [13]: >>> lineages = [
    ...: ... ('k__a; p__b; c__c', ['k__a', 'p__b', 'c__c']),
    ...: ... ('k__a; p__b; c__c; o__d', ['k__a', 'p__b', 'c__c', 'o__d']),
    ...: ... ('k__a; p__f; c__g; o__h', ['k__a', 'p__f', 'c__g', 'o__h'])
    ...: ...  ]

In [14]: print(skbio.TreeNode.from_taxonomy(lineages).ascii_art())
                                        /-k__a; p__b; c__c
                    /p__b---- /c__c----|
--------- /k__a----|                    \o__d---- /-k__a; p__b; c__c; o__d
                   |
                    \p__f---- /c__g---- /o__h---- /-k__a; p__f; c__g; o__h
gwarmstrong commented 3 years ago

Not exactly, the tree that create_tree_node_from_lineages outputs will look more like this:


                    /k__a; p__b---- /k__a; p__b; c__c---- /-k__a; p__b; c__c; o__d
--------- /k__a----|  
                   |
                    \k__a; p__f---- /k__a; p__f; c__g---- /-k__a; p__f; c__g; o__h

So there is no redundancy in the nodes and the same depth in the tree will always refer to the same taxonomic level.

wasade commented 3 years ago

It omits a tip for k__a; p__b; c__c?

gwarmstrong commented 3 years ago

Yes. As well as propogates all lineage names.

wasade commented 3 years ago

How do features w/o full lineage information map to the tree?

gwarmstrong commented 3 years ago

Can you describe a little more or give an example of "features w/o full lineage information"?

wasade commented 3 years ago

A feature may be classified to order (e.g., "ka; pb; cc; od") and another feature in the same dataset may only be classified to phylum (e.g., "ka; pb"). I'm unsure how the representation here handles this. The from_taxonomy I think does what is needed here. In the original example, there looks to be a conflation of tip and internal nodes which I think stems from the feature ID being replaced by a taxon -- in the example below, names aren't replicated:

In [8]: lineages = [
   ...: ... ('A', ['k__a', 'p__b', 'c__c']),
   ...: ...  ('B', ['k__a', 'p__b', 'c__c', 'o__d']),
   ...: ...  ('C', ['k__a', 'p__f', 'c__g', 'o__h'])]
   ...: ...

In [9]: print(skbio.TreeNode.from_taxonomy(lineages).ascii_art())
                                        /-A
                    /p__b---- /c__c----|
--------- /k__a----|                    \o__d---- /-B
                   |
                    \p__f---- /c__g---- /o__h---- /-C
gwarmstrong commented 3 years ago

Okay yeah, so create_tree_node_from_lineages would map a feature without full lineage information to an internal node in the tree. I thought this was desirable since those features would refer to internal nodes in the taxonomy. What is the utility of artificially thinking of them as tips?

The other case that create_tree_node_from_lineages handles is that Empress needs each node to have unique names, so we need to be careful about cases like:

>>> lineages = [
... ('A', ['k__a', 'p__b', 'c__c']),
... ('A', ['k__a', 'p__b', 'c__c']),
...  ('B', ['k__a', 'p__b', 'c__c', 'o__d']),
...  ('C', ['k__a', 'p__f', 'c__g', 'o__h'])]

because IIRC TreeNode.from_taxonomy will create two nodes for A.