Toolbox is a repository encapsulating various scripts used in my research on the analysis of disease and drug related biological data sets. It contains generic utilities for data processing (e.g., parsing, network-based analysis, proximity, etc, ...).
Contents
The code here has been developed during the analysis of data in various projects such as
The package mainly consists of two types of files:
For instance, parse_drugbank.py contains methods to parse DrugBank data base (v.3) XML dump and network_utilities.py contains methods related to network generation and analysis.
Parsers available for the APIs / files provided in the following resources (note that they are specific to retrieving a certain type of information --often related to pharmacological analyses-- and might not be up-to-date):
The parsers are provided "as is" and might not work due to updates on the data format of these resources. Please contact me for suggestions, bug reports and enquiries.
Some functions in toolbox rely on the following packages. The package will load properly but certain functionality might not be available.
wrappers.py provides an easy to use interface to various methods I commonly use. It is continuously under development. Currently it contains methods to
See below for python interface to run GUILD (assumes it is properly compiled and accessible at executable_path) using A and C as seeds and a toy network:
>>> from toolbox import wrappers
>>> file_name = "toy.sif"
>>> network = wrappers.get_network(file_name, only_lcc = True)
>>> nodes = set(network.nodes())
>>> seeds = ["A", "C"]
>>> node_to_score = dict((node, 1) for node in seeds)
>>> name = "sample_run"
>>> output_dir = "./"
>>> wrappers.run_guild(name, node_to_score, nodes, file_name, output_dir, executable_path)
After this command input node score file "sample_run.node" and output node score file "sample_run.ns" will be created in the current directory.
To replicate the analysis in the paper please refer to proximity repository.
See calculate_proximity
method in wrappers.py for calculating proximity:
calculate_proximity(network, nodes_from, nodes_to, nodes_from_random=None, nodes_to_random=None, n_random=1000, min_bin_size=100, seed=452456)
For instance, to calculate the proximity from (A, C) to (B, D, E) in a toy network (given below), you can use the following code. Note that default proximity calculation uses "closest" measure and calculates the shortest pahts on the fly. On the other hand, if a different measure (such as "shortest" used), the all pairs shortest paths are calculated first and stored in a pickled file starting with "temp_" prefix in the working path. If you would like to use a pre-defined shortest path length dictionary in the default version (with "closest" measure), the dictionary can be provided via "lengths" parameter.
>>> from toolbox import wrappers
>>> file_name = "toy.sif"
>>> network = wrappers.get_network(file_name, only_lcc = True)
Shrinking network to its LCC 11 15
Final shape: 11 15
>>> nodes_from = ["A", "C"]
>>> nodes_to = ["B", "D", "E"]
>>> # Calculate proximity using default measure ("closest")
>>> d, z, (mean, sd) = wrappers.calculate_proximity(network, nodes_from, nodes_to, min_bin_size = 2, seed=452456)
>>> print (d, z, (mean, sd))
(1.0, 1.3870748387117167, (0.671, 0.2371897974197035))
>>> # Calculate proximity using "shortest" measure, all pair shortest path lengths are stored in a temp file
>>> d, z, (mean, sd) = wrappers.calculate_proximity(network, nodes_from, nodes_to, min_bin_size = 2, seed=452456, distance="shortest")
>>> print (d, z, (mean, sd))
(1.3333333333333335, 0.43423257023103884, (1.2721666666666667, 0.14086153563773976))
Toy network (toy.sif):
A 1 B
A 1 C
A 1 D
A 1 E
A 1 F
A 1 G
A 1 H
B 1 C
B 1 D
B 1 I
B 1 J
C 1 K
D 1 E
D 1 I
E 1 F
The inputs are the two groups of nodes and the network.
The proximity is not symmetric (if nodes_from and nodes_to are swapped,
the results would be different, see below for details).
The nodes in the network are binned such that the nodes in the same bin have similar degrees.
For real networks, use a larger min_bin_size
(e.g., 10, 25, 50, 100, see below for choosing the bin size).
The random nodes matching the number and the degree of the nodes in the node sets are chosen
using these bins.
The average distance from the nodes in one set to the other is then calculated and compared to the
random expectation (the distances observed in random groups).
From/to nodes: Note that proximity, by definition, is not symmetric and the order of nodes_from
and
nodes_to
makes a difference. If you do not have an intrinsic relationship between
the two sets of nodes (the proximity from one to the other, such as from drugs to diseases),
you can use the node set smaller in size.
Degree binning: For random selection of nodes with similar degrees to those in the original node sets,
proximity uses binning of similar degree-nodes. This is because, high-degree nodes in the network are
less common and the randomization algorithm would choose always the same set of nodes if binning is not used.
That being said, if the bins are too large, the nodes within the bin are not a good representative of
the original nodes (spanning many nodes with different degrees). Accordingly, the bins should contain enough
number of nodes that would allow a representative random sampling. In each bin the nodes with
degree higher degree (e.g., k+1, if k is the degree of the nodes in the current bin) are added iteratively
till min_bin_size
is reached. For instance, min_bin_size
can be chosen to be at least twice
as large as the max number of nodes in the node sets.
If you use biomedical data base parsers or proximity related methods please cite: Guney E, Menche J, Vidal M, Barabási AL. Network-based in silico drug efficacy screening. Nat. Commun. 7:10331 doi: 10.1038/ncomms10331 (2016). link
If you use GUILD related methods please cite: Guney E, Oliva B. Exploiting Protein-Protein Interaction Networks for Genome-Wide Disease-Gene Prioritization. PLoS ONE 7(9): e43557 (2012). link