Tchanders / NetworkInference.jl

Methods for inferring undirected networks from data
Other
44 stars 14 forks source link

NetworkInference

Build Status codecov.io

Description

NetworkInference is a package for inferring (undirected) networks, given a set of measurements for each node. The main output is the InferredNetwork type, which represents a fully connected, weighted network, where an edge's weight indicates the relative confidence of that edge existing in the true network. See also Scope.

Some things to note:

Installation

Pkg.add("NetworkInference")

Basic usage

First include the package at the start of your script or interactive session:

using NetworkInference

One step

Given a data file and an inference algorithm, you can infer a network with a single function call:

infer_network(<path to data file>, PIDCNetworkInference())

This will return an InferredNetwork type. You can also write the inferred network to file, using the out_file_path keyword argument. See also Options.

Multiple steps

First make an array of Nodes from your data:

nodes = get_nodes(<path to data file>)

Currently the package assumes the file is of the format:

Then infer a network:

inferred_network = InferredNetwork(PIDCNetworkInference(), nodes)

An InferredNetwork has an array of nodes and an array of edges between all possible node pairs (sorted in descending order of edge weight, i.e. confidence of the edge existing in the true network).

You can write the network to file:

write_network_file(<path to output file>, inferred_network)

Options

The following keyword arguments can be passed in to infer_network:

delim (Union{Char,Bool}) Column delimiter

discretizer (String) Method for discretizing

estimator (String) Estimator for estimating the probability distribution

number_of_bins (Integer)

base (Number) Base of the logarithm, i.e. the units for entropy

out_file_path (String) Path to the output network file

Defaults for discretizer and estimator are explained in [1]

Scope

This package is not designed for analysing networks/graphs or calculating network/graph metrics. In order to do such analyses, another package should be used (e.g. LightGraphs). Of course, the edge list or the InferredNetwork will need to be parsed into the appropriate data structure first; the method get_adjacency_matrix may help with this.

Note that the InferredNetwork type contains a list of every possible edge, and the confidence of each edge existing in the true network. For analysing the properties of an inferred network, you may first want to define a partially connected, unweighted network by classifying each edge as "in the network" or "not in the network", based on the confidences. The simplest ways to do this are either to decide that the top x percent of edges are "in the network", or to define a threshold confidence, above which edges are "in the network".

You can pass a threshold into get_adjacency_matrix to get the adjacency matrix of a thresholded network (as well as dictionaries to map the node labels to their numerical IDs within the matrix, and vice versa):

get_adjacency_matrix(inferred_network, 0.1) # Keeps top 10% edges with the largest weights

get_adjacency_matrix(inferred_network, 0.1, absolute = true) # Keeps all edges with weights >= 0.1

Performance

It may be possible to speed up an analysis, particularly for large datasets, by using multiple processes.

If multiple processes are available, NetworkInference will distribute the most costly calculations across the processes. (These are the for loops in get_mi_scores and get_puc_scores.)

Example

$ ./julia -p 3

julia> using NetworkInference

julia> infer_network(<path to data file>, PIDCNetworkInference())

This opens the Julia REPL with 3 extra processes (so 4 in total). NetworkInference may then be used as normal; it will handle distributing the calculations.

Note that the performance gain from distributing calculations is offset by communicating between the processes, so for small datasets it is more efficient to use one process. For the same reason, using too many processes will degrade performance, so it is a good idea to do some timing tests with different numbers of processes.

Contributing

Bug reports, pull requests and other contributions are welcome!

References

[1] Chan, Stumpf and Babtie (2017) Gene Regulatory Network Inference from Single-Cell Data Using Multivariate Information Measures Cell Systems