biocore / gneiss

compositional data analysis toolbox
https://biocore.github.io/gneiss/
BSD 3-Clause "New" or "Revised" License
55 stars 28 forks source link

[Question] How to correctly use ILR with balance tree basis? #262

Open mortonjt opened 6 years ago

mortonjt commented 6 years ago

This was originally asked here https://github.com/biocore/scikit-bio/issues/1582

There are two details that confuse me regarding using ilr and a custom basis argument.

I understand that gneiss is used for the balance_basis and that m-1 attributes/coumns are returned by ilr which are supposed to represent the balances of the nodes in the tree.

**My first question, how can I know which column goes to which node in my dendrogram?

My second question, how does the ilr function know the labels/order of the nodes in the tree that was provided and does my order for the mat input matter at all?**

Lastly, (bonus) is the implementation below equivalent to PhILR?

My df_otus is a pd.DataFrame that I filtered. The newick tree was created by clustalo when I did a multiple sequence alignment of the Otu centroid sequences from uparse.

import ete3, skbio
from gneiss.balances import balance_basis
import pandas as pd
from io import StringIO

# Otu table
df_otus = df_counts_f1.copy() # filter Otu counts where n=64 samples, m=883 otus
# Path to the newick file generated from 16S sequences in clustalo
path_to_newick_from_clustalo = "./Data/otu_clustering/clustalo_output/otus.qfilter.EE0.15.guidetree"
# Convert ete3 tree
t = ete3.Tree(newick=path_to_newick_from_clustalo)
# Prune tree
t.prune(df_otus.columns)
# Get newick string for pruned tree
newick = t.write(outfile=None)
# Convert to skbio Tree object
t_skbio = skbio.TreeNode.read(StringIO(newick))
# Gneiss
basis, nodes = balance_basis(t_skbio)
# ILR
df_ilr = pd.DataFrame(ilr(df_otus+1, basis=basis), index=df_otus.index)
mortonjt commented 6 years ago

@jolespin you are asking all of the right questions -- getting the ids aligned is probably the most annoying part wrt using the ilr transform

The most important thing to realize is that the ilr transform will transform proportions at the tips of the tree into log-ratios in the ancestral nodes of the tree.

There are two ways you can link back the balances to the OTUs (assuming that all of your internal nodes are labeled - everything breaks if you have None labels in your internal nodes).

  1. Visualize the tree and decorate the internal nodes / tips with different statistics. This is currently done in radialplot - an examples can be found below https://biocore.github.io/gneiss/docs/v0.4.0/tutorials/python/cfstudy-python-tutorial.html https://biocore.github.io/gneiss/docs/v0.4.0/tutorials/python/cfs-python-tutorial.html

  2. Extract the OTUs manually from the OTUs, if you are interested in a specific internal node (i.e. y1 if you labeled the tree via gneiss.util.rename_internal_nodes), you can look up the OTUs as follows

from gneiss.util import NUMERATOR, DENOMINATOR
num_otus = t_skbio.find('y1').children[NUMERATOR].subset()
denom_otus = t_skbio.find('y1').children[DENOMINATOR].subset()

Not great, some more thought needs to done to figure out how to clean up the querying process

So back to your original questions

  1. Yes, if you have labeled your internal nodes in the tree, you can map these names back to the columns resulting from the ILR transform (I would look at the ilr_function for a more seamless way to do this.

  2. The ordering in the dataframe doesn't matter as long as the labels are in place

  3. This isn't exactly like philr, since philr is using a fancy weighting scheme to incorporate the branch lengths into the actual ilr transform. Possibly @jsilve24 can weigh in on this.

jolespin commented 6 years ago

@mortonjt Thanks a lot for getting back to me so quickly with such a detailed response! I looked into it and your explanation plus links were extremely helpful. I actually understand what is happening after looking at the ilr_transform function since I was a little confused after skimming the paper. By the way, great coding style with that function. Very easy to interpret, very clean, and I haven't seen the type suggestions used in practice yet.

One thing I wanted to point out was the ete3_to_skbio function below. I stumbled on your question and tweaked it a little bit so it relabels only the non-leaf nodes. Is this the correct way?

I am close to understanding the NUMERATOR and DENOMINATOR section but it seems like since we are dealing with log-ratios, the NUMERATOR and DENOMINATOR sets correspond to which organisms were on top of the fraction during the calculation so one can go back to their respective y* and look at the raw values? Would those values be based off the mean proportions of the left subtree vs. the mean of the right subtree? I am getting an empty set for my denominator Otus.

from gneiss.util import NUMERATOR, DENOMINATOR
num_otus = t_skbio.find('y1').children[NUMERATOR].subset()
denom_otus = t_skbio.find('y1').children[DENOMINATOR].subset()
len(num_otus), len(denom_otus)
# (882, 0)
from gneiss.balances import balance_basis
from gneiss.composition import ilr_transform
# Otu table
df_otus = df_counts_f1.copy() # filter Otu counts where n=64 samples, m=883 otus
# Path to the newick file generated from 16S sequences in clustalo
path_to_newick_from_clustalo = "./Data/otu_clustering/clustalo_output/otus.qfilter.EE0.15.guidetree"
# Convert ete3 tree
t = ete3.Tree(newick=path_to_newick_from_clustalo)
# Prune tree
t.prune(df_otus.columns)
# Get newick string for pruned tree
def ete3_to_skbio(tree):
    intermediate_node_index = 1
    for node in tree.traverse():
        if not node.is_leaf():
            node.name = f"y{intermediate_node_index}"
            intermediate_node_index += 1
    return skbio.TreeNode.read(StringIO(tree.write(format=1, format_root_node=True)))
t_skbio = ete3_to_skbio(t)
# Gneiss
basis, nodes = balance_basis(t_skbio)
# ILR
df_ilr = ilr_transform(df_otus+1, t_skbio)
mortonjt commented 6 years ago

Its possible that the denominator is a tip, which will return an empty set when using subset (a really annoying edge case). In your case, it would be something like

num_otus = t_skbio.find('y1').children[NUMERATOR].subset()
denom_otus = t_skbio.find('y1').children[DENOMINATOR].name

Definitely not a very friendly way to interpret your balances - an easier way would be to use the balance-taxonomy command in gneiss (see gneiss tutorial)

jolespin commented 6 years ago

Ah man I didn't even think about that edge case. Thanks for catching that. I've looked into the tutorial and it's helpful to see the applications. Right now I'm using Python 3.6 and many of my startup scripts use the f"string" formatting (not compatible with Python 3.5) so I can't use qiime in my current environment.

Is there any source code you can direct me to on github that shows the commands used in qiime gneiss balance-taxonomy? I checked both https://github.com/qiime2/qiime2/tree/master/qiime2 and https://github.com/qiime2/q2-gneiss to no avail.

Also, I created a helper script that evades the edgecase:

# Numerator and denominator of balance trees
def balance_ratio_attributes(tree:skbio.TreeNode, node_query:str="y1") -> pd.Series:
    """
    Get numerator and denominator of nodes
    """
    def get_node_subset(query, tree, node_query):
        attributes = tree.find(node_query).children[query].subset()
        if len(attributes) > 0:
            return attributes
        else:
            return frozenset([tree.find(node_query).children[query].name])

    num_attributes = get_node_subset(NUMERATOR, tree=tree, node_query=node_query)
    denom_attributes = get_node_subset(DENOMINATOR, tree=tree, node_query=node_query)
    idx_leaves = [*map(lambda node:node.name, tree.tips())]
    return pd.Series({
        **dict(zip(num_attributes,
                 len(num_attributes)*["numerator"])),
        **dict(zip(denom_attributes,
         len(denom_attributes)*["denominator"]))
    },name=node_query)[idx_leaves].dropna()
mortonjt commented 6 years ago

@jolespin couple comments here

  1. The balance-taxonomy command can be found here most of it is formatting for the visualization

  2. conda can help with this usecase, since you can conda install python=3.5. Qiime2 should just work on basically any modern system (provided you create a new conda environment for it).