Closed gwarmstrong closed 2 years ago
Does the following change on input create the same result?
In [13]: >>> lineages = [
...: ... ('k__a; p__b; c__c', ['k__a', 'p__b', 'c__c']),
...: ... ('k__a; p__b; c__c; o__d', ['k__a', 'p__b', 'c__c', 'o__d']),
...: ... ('k__a; p__f; c__g; o__h', ['k__a', 'p__f', 'c__g', 'o__h'])
...: ... ]
In [14]: print(skbio.TreeNode.from_taxonomy(lineages).ascii_art())
/-k__a; p__b; c__c
/p__b---- /c__c----|
--------- /k__a----| \o__d---- /-k__a; p__b; c__c; o__d
|
\p__f---- /c__g---- /o__h---- /-k__a; p__f; c__g; o__h
Not exactly, the tree that create_tree_node_from_lineages
outputs will look more like this:
/k__a; p__b---- /k__a; p__b; c__c---- /-k__a; p__b; c__c; o__d
--------- /k__a----|
|
\k__a; p__f---- /k__a; p__f; c__g---- /-k__a; p__f; c__g; o__h
So there is no redundancy in the nodes and the same depth in the tree will always refer to the same taxonomic level.
It omits a tip for k__a; p__b; c__c
?
Yes. As well as propogates all lineage names.
How do features w/o full lineage information map to the tree?
Can you describe a little more or give an example of "features w/o full lineage information"?
A feature may be classified to order (e.g., "ka; pb; cc; od") and another feature in the same dataset may only be classified to phylum (e.g., "ka; pb"). I'm unsure how the representation here handles this. The from_taxonomy
I think does what is needed here. In the original example, there looks to be a conflation of tip and internal nodes which I think stems from the feature ID being replaced by a taxon -- in the example below, names aren't replicated:
In [8]: lineages = [
...: ... ('A', ['k__a', 'p__b', 'c__c']),
...: ... ('B', ['k__a', 'p__b', 'c__c', 'o__d']),
...: ... ('C', ['k__a', 'p__f', 'c__g', 'o__h'])]
...: ...
In [9]: print(skbio.TreeNode.from_taxonomy(lineages).ascii_art())
/-A
/p__b---- /c__c----|
--------- /k__a----| \o__d---- /-B
|
\p__f---- /c__g---- /o__h---- /-C
Okay yeah, so create_tree_node_from_lineages
would map a feature without full lineage information to an internal node in the tree. I thought this was desirable since those features would refer to internal nodes in the taxonomy. What is the utility of artificially thinking of them as tips?
The other case that create_tree_node_from_lineages
handles is that Empress needs each node to have unique names, so we need to be careful about cases like:
>>> lineages = [
... ('A', ['k__a', 'p__b', 'c__c']),
... ('A', ['k__a', 'p__b', 'c__c']),
... ('B', ['k__a', 'p__b', 'c__c', 'o__d']),
... ('C', ['k__a', 'p__f', 'c__g', 'o__h'])]
because IIRC TreeNode.from_taxonomy
will create two nodes for A
.
This indexes a genus level taxonomy tree with propagated labels in bp format. This change affects the empress endpoint, giving a smaller tree.
There is a function
create_tree_node_from_lineages
, which is similar toskbio.TreeNode.from_lineages
, but is preferred becauseTreeNode.from_lineages
has some undesirable behavior for this application, e.g., the repeatedc__c
node: