Higher-order interactions toolbox
This repository includes a unified Python-based toolbox to compute higher-order interactions with metrics such as O-information and dO-information. These measures scale well with the number of time series involved in comparison to most other metrics proposed so far to distinguish higher-order features in the data.
The functioning of complex systems (i.e. the brain, and many others) depends on the interaction between different units; crucially, the resulting dynamics is different from the sum of the dynamics of the parts. What allows these systems to be more than the sum of their parts is not in the nature of the parts, but in the structure of their interdependencies. In order to deepen our understanding of these systems, we need to make sense of these interdependencies. Several tools and frameworks have been developed to look at different statistical dependencies among multivariate datasets. Among these, information theory offers a powerful and versatile framework; notably, it allows detecting higher-order interactions that determine the joint informational role of a group of variables.
Now these interactions amongst variables can either be synergestic or redundant, and O-information and dO-information provide us with scalable metrics which are capable of characterising synergy- and redundancy-dominated systems and whose computational complexity scales gracefully with system size, making it suitable for practical data analysis.
Concepts like redundancy and synergy can be understood well intuitively, but in order to understand them at a greater detail (using mathematical expressions) one might need to first familiarize with negentropy, collective constraints and shared randomness. For more details, one can refer to the section II (fundamentals) from the paper which introduced O-information. This section walks one through how the total information of a system described by a random vector is divided by a given state of knowledge (i.e. a probability distribution) into what is determined by the constraints and what is not instantiated until an actual measurement takes place . Moreover, both terms can be further decomposed into their individual and collective components, yielding different perspectives on interdependency seen as either collective constraints or shared randomness .
In short, if the interdependencies can be more effeciently explained as shared randomness then we call the system to be redundancy-dominated, or else if the interdependencies can be more effeciently explained as collective constraints then we call the system to be synergy-dominated. We would eventually see that this can be extracted based on whether O-info or dO-info of the system is greater than zero (redundancy-dominated) or less than zero (synergy-dominated).
The O-information (shorthand for "information about Organisational structure") of the system described by the vector of n random variables (minor change in notation, dropping the superscript "n" for convenience) is defined as -
where stands for the entropy, and is the complement of with respect to . This equation is implemented here in the code (just to help link with the expressions in the paper better).
Now further if we add a stochastic variable Y to the set of X variables and extend it to measure the character of the information flow from the X variables to the target Y and further condition on the state vector of the target variable in order to remove shared information due to common history and input signals, then we arrive at dynamic O-information or dO-information from the group of variables X to the target series Y, defined as -
where stands for the mutual information and the state vectors are of order . This equation is implemented here in the code (again just to help link with the expressions in the paper better).
For further details, readers are urged to read the papers on O-information and dO-information
The goal of this toolbox is to collect and refine existing metrics of higher-order interactions currently implemented in Matlab or Java, and integrate them in a unified python-based toolbox. The main deliverable of the project is a toolbox, whose inputs are the measurements of different variables plus some parameters, and whose outputs are the measures of higher-order interdependencies.
The main codes for computing Oinfo and dOinfo are in ./toolbox/Oinfo.py
and ./toolbox/dOinfo.py
.
The exhaustive loop functions would do the grid search over all the multiplet sizes less than the provided maxsize and compute the outputs and append to the output dict. The functions for estimating entropy and conditional mutual information are in ./toolbox/gcmi.py
and ./toolbox/lin_est.py
for both kinds of estimators respectively. The combinations manager class and loading/saving functions are incuded ./toolbox/utils.py
.
Ideally you'd only need to run main.py
, but if you want to explore and play around you could also explore toy_example.py
which is quite similar to main.py
but can be used for prototyping and debugging on a subset of inputs. To check the runtime of different estimators, one can use check_gcmi_timing.py
.
The input is usually a timeseries, currently .mat and .tsv formats are supported by the toolbox. The exhaustive_loop_zerolag
function from toolbox.Oinfo
and exhaustive_loop_lagged
from toolbox.dOinfo
takes a timeseries of shape (number of variables, number of timepoints). If running locally, feel free to use either of the .mat or .tsv input formats, but if running on brainlife .tsv format will be recommended as it matches the default timeseries datatype in .tsv.gz. Note that you wouldn't need to seperately run the functions exhaustive_loop_zerolag
or exhaustive_loop_lagged
as main.py
would do the preprocessing (such as removing the columns with all zero timeseries and generate data/cleaned_timeseries.tsv
in case of .tsv input) and then pass it on to these functions. These functions will return a dictionary output which will be saved in pickle format. Ideally reading or writing to pickle shouldn't give any errors, but just in case it does, then try using pickle5 instead of pickle in utils.py
. The main outputs would be an array of sorted redundancy and synergy values, the indices of the variable combinations that gave those values and the bootstrap signifiance for each of those values. The output for dOinfo has an additional nested key denoting the target variable. For more details on how to read the outputs please read the subsequent sections. The default folders used for input timeseries data are ./data
and the output are ./output
; the output folder is missing, it'll create one.
To run this toolbox on your input timeseries, ideally you'd only need to run the main.py
python file and need a config.json
from which it can pick-up the necessary parameters to run the code. You'd only need to fill in the details about timeseries path, type and parameters/arguments in the config.json, please see an example -
{
"input_type": "tsv",
"input": "data/timeseries.tsv.gz",
"metric": "dOinfo",
"higher_order": true,
"estimator": "gcmi",
"modelorder":3,
"maxsize":4,
"n_best":10,
"nboot":100
}
The "input_type" argument can either be mat or tsv.
The "input" directs to the path of the timeseries input.
Use the "metric" argument to choose which metric to compute, either "Oinfo" or "dOinfo".
The "higher_order" argument is a boolean (can be either true or false). By setting it true, the code will generate each subsequent combination iteratively and map the index to the variable combination using the combinatorial numbering system, rather than computing all possible combinations before hand and running out of memory. For example, 100 choose 5 is already 75287520 and a practical dataset is bound to have more than 100 ROIs/nodes/timeseries and using higher_order=true will enable to you to go to higher orders (much higher than 5) without running out of memory. But for smaller inputs that won't overwhelm your RAM, users are advised to set it to false as it makes it slightly easier to read the outputs (more details in read_outputs.py
).
The "estimator" argument can be either set to "gcmi" or "lin_est". If set to "gcmi", the code will use gaussian copula approach to compute entropy and mutual information or else if set to "lin_est" then the code will used covariance matrix based linear estimators. In certain scenarios, covariance matrix based estimators might be slightly faster, though cannot guarantee that yet.
The "modelorder" argument is only relevant to dOinfo computation and not Oinfo computation, as it specifies the order of the timeseries/state vectors as in how many past timepoints to consider. Feel free to vary this argument according to your dataset.
The "maxsize" argument determines the max size of the multiplet to which it would do a brute-force/grid search and compute the outputs for all possible multiplet sizes less than maxsize. If you wish to go to higher orders, please increase this argument.
The "n_best" argument helps you decide how many out top redundancy or synergy combinations (sorted in decreasing order of their magnitudes), do you want to include in the outputs.
The "nboot" argument determines how many bootstrap samples to compute theh histogram while determing the significance of the redundancy and synergy values.
If you are running on brainlife, it'll create a config.json
or else if you are runnning locally you'll need to create a config.json
by copying config-sample.json
. The main.py
also takes an optional single argument of the path to config (.json) file, but if not provided it'll look for config.json
. You can either run the toolbox by running the docker by just running ./main
, which would setup the necessary environment and run main.py
with config.json
as the argument or else you could install the libraries from requirements.txt
and run main.py
. You will need to mention things like path to the input data and its datatype and all the arguments for the Oinfo and dOinfo code in the config.json
beforehand.
Please refer to read_outputs.py
for a detailed walkthrough through examples! It uses some sample outputs already generated and saved in outputs
folder.
A sample raw output of an Oinfo code would look something like this -
{'sorted_red': array([0.14653095, 0.13152774, 0.12752084, 0.12326999, 0.09590787,
0.07990326, 0.07964545, 0.07481491, 0.06944254, 0.06820533]), 'index_red': array([105, 115, 113, 80, 114, 88, 95, 119, 35, 28], dtype=int64), 'bootsig_red': array([[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.]]), 'sorted_syn': array([-0.1963071 , -0.16986031, -0.13420828, -0.12996375, -0.10197577,
-0.09634448, -0.09016494, -0.08868647, -0.08328115, -0.08035267]), 'index_syn': array([ 48, 1, 56, 92, 6, 55, 82, 2, 63, 116], dtype=int64), 'bootsig_syn': array([[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.]])}
But please go through read_outputs.py
to how to make sense of it! Unfortunately including it in this readme.md would make it very lengthy :)
This work was done by Pranav Mahajan as a part of GSOC-2021 at INCF under the guidance and mentorship of Daniele Marinazzo and Fernando Rosas.
check_gcmi_timing.py
for more). We arrived at the conclusion that the discrepancy is not in the bootstrapping as a small discrepancy remained even after turning off the bootci. Covariance matrix based linear estimator was implemented in hopes of solving this issue, but they don't seem to speedup substantially. Other approaches which were tried involve using Numba and future work can also try multi-threading to speedup.A weekly progress of the GSOC project can be found in this document.