This is the source code for the paper "Codiscovering graphical structure and functional relationships within data: A Gaussian Process framework for connecting the dots".
Please see the companion blog post for a gentle introduction to the method and the code.
The code is written in Python 3 and requires the following packages:
The code is available on PyPI and can be installed using pip:
pip install ComputationalHypergraphDiscovery
After cloning this repo, you install the required packages using pip:
pip install -r requirements.txt
and add the repo to your PYTHONPATH:
git clone https://github.com/TheoBourdais/ComputationalHypergraphDiscovery.git
export PYTHONPATH=$PYTHONPATH:/path/to/ComputationalHypergraphDiscovery/src
Graph discovery takes very little time. The following code runs the method on the example dataset provided in the repo. The dataset is a 2D array of shape (n_samples, n_features) where each row is a sample and each column is a feature. After fitting the model, the graph is stored in the GraphDiscovery
object, specifically its graph G
attribute. The graph is a networkx
object, which can be easily plotted using .plot_graph()
.
You can find the Sachs dataset in the repo, at this link.
import ComputationalHypergraphDiscovery as CHD import pandas as pd df=pd.read_csv('https://github.com/TheoBourdais/ComputationalHypergraphDiscovery/blob/8c3fed6dfe58a01cb73e469579ff7703b0681f7e/examples/Sachs/SachsData.csv?raw=true') df=df.sample(n=500,random_state=1) #subsample to run example quickly kernels=[CHD.Modes.LinearMode(),CHD.Modes.QuadraticMode()] graph_discovery = CHD.GraphDiscovery.from_dataframe(df,kernels=kernels) graph_discovery.fit() graph_discovery.plot_graph()
You should obtain the following graph (note that it will slightly differ from the graph in the paper due to the subsampling):
The code gives an easy-to-use interface to manipulate the graph discovery method. It is designed to be modular and flexible. The main changes you can make are
In order to initiate the graph discovery method, you need to create a GraphDiscovery
object. The main parameters are:
X
: The data used to fit the model. It is a 2D array of shape (n_features,n_samples) where each row is a feature and each column is a sample.node_names
: The names of the nodes. It is a list of strings of length n_features.kernels
: The kernels used to link the nodes. If none is provided, a default kernel is used. See the section on manipulating kernels for more details.cluster
and possible_edges
: additional features to inform the graph discovery process and refine the recovery. See the section on manipulating clusters and possible edges for more details.verbose
: to choose whether to print information during the fitting process.Here is an example:
import ComputationalHypergraphDiscovery as CHD
import numpy as np
X=np.random.rand(10,100)
node_names=[f'node_{i}' for i in range(10)]
graph_discovery = CHD.GraphDiscovery(X,node_names)
Note: This shows how to create a
GraphDiscovery
object from a numpy array. If you want to follow this tutorial, use the Sachs dataset provided in the repo as a reference. You can use the code below to load it:
If you have a Pandas dataframe, you can use the from_dataframe
method (see the method's docstring for more details):
import ComputationalHypergraphDiscovery as CHD
import pandas as pd
df=pd.read_csv('https://github.com/TheoBourdais/ComputationalHypergraphDiscovery/blob/8c3fed6dfe58a01cb73e469579ff7703b0681f7e/examples/Sachs/SachsData.csv?raw=true')
df=df.sample(n=500,random_state=1) #subsample to run example quickly
graph_discovery = CHD.GraphDiscovery.from_dataframe(df)
Once the object is created, you can fit the model using the .fit()
method. The fit method takes the following parameters:
targets
: Which nodes you wish to discover the ancestors of. By default, we recover the ancestors of all nodes. kernel_chooser
and mode_chooser
: Decision logic to refine the graph discovery process. See the section on manipulating decision logics for more details.gamma
: The noise parameter of the kernel. By default, it is automatically estimated. We advise against setting it manually, as a good choice of gamma is crucial for the performance of the algorithm and is unintuitive to find.gamma_min
: The minimum value of gamma when finding it automatically. By default, it is 1e-9
. This is necessary for numerical stability.Here is an example:
graph_discovery.fit()
If you wish to discover the ancestors of a specific node, you can do so by specifying the targets
parameter:
graph_discovery.fit(targets=['$Raf$'])
Once the model is fitted, you can access all the results of the graph discovery using the .G
attribute. It is a networkx
object, and you can use any of the methods provided by networkx
. For example, you can plot the graph using the .plot_graph()
method:
graph_discovery.plot_graph()
The
.plot_graph()
method allows for some customization of the resulting plot. See the method's docstring for more details on available parameters.Disclaimer: Note that the data is assumed to be real numbers. The algorithm only accepts data in the form of a 2D array of shape (n_samples,n_features). Other shapes will be rejected, and other types of data will be treated as real numbers.
The kernel is the function used to define the type of functions that will link the nodes. The code provides a set of kernels, but you can easily add your own. The interface is designed to ressemble the scikit-learn API, and you can use any kernel from scikit-learn.
Definitions: a kernel is a function $k$ such that $k(x,y)=\langle \phi(x),\phi(y)\rangle$ for some feature map $\phi$. it allows the definition of a similarity measure between two points and enhances the expressivity of the model. We use kernels extensively in this project and will define some terms here:
- Kernel: a function $k$ used to perform the downstream ML task (here, identify the ancestors).
- Kernel Matrix: Given a kernel $k$ and a dataset $X$ of shape (n_features,n_samples), the kernel matrix is a matrix of shape (n_samples,nsamples) defined by $K{i,j}=k(X_i, X_j)$.
- Kernel Mode: A kernel mode is a kernel itself that is used to form a sum kernel (if $k=k_1+k_2$, $k_1$ and $k_2$ are kernel modes of $k$)
- Kernel Mode Matrix: Given kernel modes $k_1,..,k_p$ of $k$ ($k=k_1+..+k_p$) and a dataset $X$ of shape (n_features,n_samples), the kernel mode matrix is a matrix of shape (p,n_samples,nsamples) defined by $K{l, i,j}=k_l(X_i, X_j)$ where $kl$ is the ith kernel mode of $k$. We have $K{i,j}=k(X_i,Xj)=\sum{l=1}^p K_{l,i,j}$.
To manipulate kernels, import the modes module:
import ComputationalHypergraphDiscovery.Modes as Modes
The ModeKernel
interface defines our kernel modes. An example of a kernel mode is the LinearMode
class, which implements the linear kernel that can be used as a mode. Currently, we only provide the 3 kernels used in the paper, applied to $x$ and $y$ that are rows of our dataset:
Modes.LinearMode()
Modes.QuadraticMode()
l
the lengthscaleModes.GaussianMode(l=1)
Interpolatory:
A kernel $k(x,y)=\langle\phi(x),\phi(y)\rangle$ will be interpolatory if its associated feature map $\phi$ has more dimensions than the number of data points. For example, the Gaussian kernel is interpolatory (its feature map is infinite-dimensional), while the linear kernel will probably not be (if the input is of dimension d, the number of data points is n, if n>d the kernel is not interpolatory).
Caution: We have defined here a kernel with 3 modes: one linear, one quadratic, and one Gaussian. However, for the overall algorithm to work, we prefer to define the 3 following hierarchical modes:
- Linear mode
- Linear + quadratic mode
- Linear + quadratic + Gaussian mode
This allows us to have modes of increasing complexity so that when we change mode, the new one is strictly more expressive than the previous one (this means we widen the search as we progress)
In order to identify the edges of the graph, we need to decide whether certain connections are significant. The code provides indicators (like the level of noise), and the user specifies how to interpret them. The functions used to make these decisions are available in the decision
module.
import ComputationalHypergraphDiscovery.decision as decision
There are two leading indicators that we use to make decisions (See the paper or the blog post for more details):
There are three types of decisions to be made in the algorithm:
The first decision to make is to decide if a node has ancestors or not, as well as the kernel with which the node has ancestors. To do so, we compute the signal-to-noise ratio and the Z_test
for each kernel. We get the following performances:
An example of output is the following dictionary (some keys have been hidden for clarity):
kernel_performances={
'linear': {
'noise-to-signal ratio': 0.45,
'Z_test': [0.98, 1.0]
},
'quadratic': {#see note above, this is not the quadratic kernel but linear+quadratic
'noise-to-signal ratio': 0.91,
'Z_test': [0.96, 1.0]
},
'gaussian': {#same comment as above
'noise-to-signal ratio': 0.33,
'Z_test': [0.74, 1.0]
}
}
Several decisions can be made using this dictionary. Decisions are made by instances of the KernelChooser
class.
ThresholdKernelChooser
. MinNoiseKernelChooser
.Z_test
interval (which would mean the noise ratio is not statistically significant).Once we have chosen a kernel, we need to select the ancestors of the node. To do so, we remove the ancestors one by one and compute the signal-to-noise ratio and the Z_test
after each removal. The evolution of the signal-to-noise ratio and the Z_test
gives the following graph:
As we remove ancestors, it becomes harder to explain the node from the remaining ancestors. Seeing the evolution of the signal-to-noise ratio and the Z_test
, we must choose how many ancestors we wish to keep.
Note: One good indicator of when we have removed an essential ancestor is when we observe a spike in the noise ratio. This is why increments in the noise ratio are plotted on the right, as it is a good indicator.
Several decisions can be made from this evolution. By default, we use MaxIncrementModeChooser
. Decisions are made by instances of the ModeChooser
class:
ThresholdModeChooser
. MaxIncrementModeChooser
. If a set of nodes is highly dependent, it is possible to merge them into a cluster of nodes. This gives greater readability and prevents the graph discovery method from missing other connections. It also lets you perform graph discovery in a hierarchical fashion:
An example of the use of clusters is given in the Gene notebook
GraphDiscovery
objectWhen creating a GraphDiscovery
object, you can specify the cluster
parameter. It is a list of lists of strings, where each list of strings is a cluster of nodes. For example, with the Sachs dataset, we can define the following clusters:
clusters=[
['$PKC$','$P38$','$Jnk$'],
['$Erk$','$Akt$','$PKA$'],
['$Raf$','$Mek$'],
['$Plcg$','$PIP2$','$PIP3$']
]
graph_discovery = CHD.GraphDiscovery.from_dataframe(df,clusters=clusters)
GraphDiscovery
object, for the second runIf you already have a GraphDiscovery
object named graph_discovery
on which you have run the algorithm, you can choose the clusters based on the results and apply them to get a new GraphDiscovery
object:
graph_discovery2=graph_discovery.prepare_new_graph_with_clusters(clusters)
Once the clusters have been defined, you can use the GraphDiscovery
object as usual. It will handle the clustering you specified and will plot the graph showing clusters.
We advise using clusters in a multi-level fashion, as is demonstrated here.
import numpy as np
import ComputationalHypergraphDiscovery as CHD
from ComputationalHypergraphDiscovery.Modes import *
# Load the data
data = np.loadtxt('data/Sachs.txt', delimiter='\t')
# Normalize the data
data = (data - np.mean(data, axis=0)) / np.std(data, axis=0)
node_names=np.array(['$Raf$','$Mek$','$Plcg$','$PIP2$','$PIP3$','$Erk$','$Akt$','$PKA$','$PKC$','$P38$','$Jnk$'])
# Define the kernel
kernel = [0.1*LinearMode(), 0.01*QuadraticMode()]
# Set up CHD
graph_discovery = CHD.GraphDiscovery(data,node_names,kernel)
# Perform CHD
graph_discovery.fit()
graph_discovery.plot_graph()
# Refine the graph with clusters
clusters=[
['$PKC$','$P38$','$Jnk$'],
['$Erk$','$Akt$','$PKA$'],
['$Raf$','$Mek$'],
['$Plcg$','$PIP2$','$PIP3$']
]
graph_discovery2=graph_discovery.prepare_new_graph_with_clusters(clusters)
graph_discovery2.fit()
graph_discovery2.plot_graph()
This code recovers the following graph:
As a practitioner, you may have some knowledge about the data that you wish to use to inform the graph discovery process. For example, you may know that specific nodes cannot be connected. You can specify this information to the algorithm using the possible_edges
parameter of the GraphDiscovery
object. This is a directed graph in the form of a networkx.DiGraph
object. Graph discovery will ignore all connections that are not in the possible_edges
graph. An example of use is given in the chemistry notebook.