churchmanlab / genewalk

GeneWalk identifies relevant gene functions for a biological context using network representation learning
https://churchman.med.harvard.edu/genewalk
BSD 2-Clause "Simplified" License
127 stars 14 forks source link
functional-genomics machine-learning-algorithm

GeneWalk

License Documentation PyPI version install with bioconda Python 3.8+

GeneWalk determines for individual genes the functions that are relevant in a particular biological context and experimental condition. GeneWalk quantifies the similarity between vector representations of a gene and annotated GO terms through representation learning with random walks on a condition-specific gene regulatory network. Similarity significance is determined through comparison with node similarities from randomized networks.

Install GeneWalk

To install the latest release of GeneWalk (preferred):

pip install genewalk

To install the latest code from Github (typically ahead of releases):

pip install git+https://github.com/churchmanlab/genewalk.git

GeneWalk uses a number of resource files that it downloads as needed during runtime. To optionally pre-download these resource files in the default resource folder, the command

python -m genewalk.resources

can be run.

Using GeneWalk

Gene list file

GeneWalk always requires as input a text file containing a list with genes of interest relevant to the biological context. For example, differentially expressed genes from a sequencing experiment that compares an experimental versus control condition. GeneWalk supports gene list files containing HGNC human gene symbols, HGNC IDs, human Ensembl gene IDs, MGI mouse gene IDs, RGD rat gene IDs, or human or mouse entrez IDs. GeneWalk internally maps these IDs to human genes.

For organisms other than human, mouse or rat, there are two options. The first is to map the genes to human orthologs yourself and then input the human ortholog list as described above. Use this strategy if you consider the organism sufficiently related to human. The second option is to provide an input gene file with custom gene IDs. These are not mapped to human genes. Use custom gene IDs for more divergent organisms, such as drosophila, worm, yeast, plants or bacteria. In this case the user must also provide a custom gene network with GO annotations as input. See section Custom input networks for more details.

Each line in the gene input file contains a gene identifier of one of the above types.

GeneWalk command line interface

Once installed, GeneWalk can be run from the command line as genewalk, with a set of required and optional arguments. The required arguments include the project name, a path to a text file containing a list of genes, and an argument specifying the type of gene identifiers in the file.

Example

genewalk --project context1 --genes gene_list.txt --id_type hgnc_symbol

Below is the full documentation of the command line interface:

genewalk [-h] [--version] --project PROJECT --genes GENES --id_type
              {hgnc_symbol,hgnc_id,ensembl_id,mgi_id,rgd_id,entrez_human,entrez_mouse,custom}
              [--stage {all,node_vectors,null_distribution,statistics}]
              [--base_folder BASE_FOLDER]
              [--network_source {pc,indra,edge_list,sif,sif_annot,sif_full}]
              [--network_file NETWORK_FILE] [--nproc NPROC] [--nreps NREPS]
              [--alpha_fdr ALPHA_FDR] [--save_dw SAVE_DW]
              [--random_seed RANDOM_SEED]

required arguments:
  --version             Print the version of GeneWalk and exit.
  --project PROJECT     A name for the project which determines the folder
                        within the base folder in which the intermediate and
                        final results are written. Must contain only
                        characters that are valid in folder names.
  --genes GENES         Path to a text file with a list of differentially
                        expressed genes. Thetype of gene identifiers used in
                        the text file are provided in the id_type argument.
  --id_type {hgnc_symbol,hgnc_id,ensembl_id,mgi_id,rgd_id,entrez_human,entrez_mouse,custom}
                        The type of gene IDs provided in the text file in the
                        genes argument. Possible values are: hgnc_symbol,
                        hgnc_id, ensembl_id, mgi_id, rgd_id, entrez_human,
                        entrez_mouse, and custom. If custom, a network_source
                        of sif_annot or sif_full must be used.

optional arguments:
  --stage {all,node_vectors,null_distribution,statistics,visual}
                        The stage of processing to run. Default: all
  --base_folder BASE_FOLDER
                        The base folder used to store GeneWalk temporary and
                        result files for a given project. Default:
                        ~/genewalk
  --network_source {pc,indra,edge_list,sif,sif_annot,sif_full}
                        The source of the network to be used.Possible values
                        are: pc, indra, edge_list, sif, sif_annot, and
                        sif_full. In case of indra, edge_list, sif, sif_annot,
                        and sif_full, the network_file argument must be
                        specified. Default: pc
  --network_file NETWORK_FILE
                        If network_source is indra, this argument points to a
                        Python pickle file in which a list of INDRA Statements
                        constituting the network is contained. In case
                        network_source is edge_list, sif, sif_annot, or
                        sif_full, the network_file argument points to a text
                        file representing the network. See README section
                        Custom input networks for full description of file
                        format requirements.
  --nproc NPROC         The number of processors to use in a multiprocessing
                        environment. Default: 1
  --nreps_graph NREPS_GRAPH
                        The number of repeats to run when calculating node
                        vectors on the GeneWalk graph. Default: 3
  --nreps_null NREPS_NULL
                        The number of repeats to run when calculating node
                        vectors on the random network graphs for constructing
                        the null distribution. Default: 3
  --alpha_fdr ALPHA_FDR
                        The false discovery rate to use when outputting the
                        final statistics table. If 1 (default), all
                        similarities are output, otherwise only the ones whose
                        false discovery rate are below this parameter are
                        included. Default: 1 
                        For visualization a default value of 0.1 for both global
                        and gene-specific plots is used. Lower this value to 
                        increase the stringency of the regulator gene selection 
                        procedure.
  --dim_rep DIM_REP     Dimension of vector representations (embeddings). This 
                        value should only be increased if genewalk with the 
                        default value generates no statistically significant 
                        results, for instance with very large (>2500) input 
                        gene lists. Alternatively, it can be decreased in case 
                        (nearly) all GO annotations are significant, for 
                        instance with very short gene lists. Default: 8
  --save_dw SAVE_DW     If True, the full DeepWalk object for each repeat is
                        saved in the project folder. This can be useful for
                        debugging but the files are typically very large.
                        Default: False
  --random_seed RANDOM_SEED
                        If provided, the random number generator is seeded
                        with the given value. This should only be used if the
                        goal is to deterministically reproduce a prior result
                        obtained with the same random seed.

Output files

GeneWalk automatically creates a genewalk folder in the user's home folder (or the user specified base_folder). When running GeneWalk, one of the required inputs is a project name. A sub-folder is created for the given project name where all intermediate and final results are stored. The files stored in the project folder are:

Figure files

GeneWalk also automatically generates figures to visualize its results in the project/figures sub-folder:

GeneWalk results file description

genewalk_results.csv is the main GeneWalk output table, a comma-separated values text file with the following column headers:

Run time and stages of GeneWalk algorithm

Recommended number of processors (optional argument: nproc) for a short (1-2h) run time is 4:

genewalk --project context1 --genes gene_list.txt --id_type hgnc_symbol --nproc 4

By default GeneWalk will run with 1 processor, resulting in a longer overall run time: 6-12h. Given a list of genes, GeneWalk runs three stages of analysis:

  1. Assembling a GeneWalk network and learning node vector representations by running DeepWalk on this network, for a specified number of repeats. Typical run time: one to a few hours.
  2. Learning random node vector representations by running DeepWalk on a set of randomized versions of the GeneWalk network, for a specified number of repeats. Typical run time: one to a few hours.
  3. Calculating statistics of similarities between genes and GO terms, and outputting the GeneWalk results in a table. Typical run time: a few minutes.
  4. Visualization of the GeneWalk results generated in the project/figures subfolder. Typical run time: 1-10 mins depending on the number of input genes.

GeneWalk can either be run once to complete all these stages (default), or called separately for each stage (optional argument: stage). Recommended memory availability on your operating system: 16Gb or 32Gb RAM. GeneWalk outputs the uncertainty (95% confidence intervals) of the similarity significance (global and gene p-adjust). Depending on the context-specific network topology, this uncertainty can be large for individual gene - function associations. However, if overall the uncertainties turn out very large, one can set the optional arguments nreps_graph to 10 (or more) and nreps_null to 10 to increase the algorithm's precision. This comes at the cost of an increased run time.

Custom input networks

By default, GeneWalk uses the PathwayCommons resource (--network_source pc) to create a human gene network. It then automatically adds edges representing GO annotations for input genes and ontology relations between GO terms. However, there are options to run GeneWalk with a custom network as an input.

First, specify the --network_source argument as one of the alternative sources: {indra, edge_list, sif, sif_annot, sif_full}.

If custom gene IDs are used (--id_type custom) in the input gene list, for instance from a model organism: choose as network source sif_annot or sif_full.

Then, include the argument --network_file with the path to the custom network input file. The network file format has to correspond to the chosen --network_source, as follows.

The sif/sif_annot/sif_full options require the network file in a simple interaction file (SIF) format. Each row of the SIF text file consists of three comma-separated entries representing source, relation type, and target. The relation type is not explicitly used by GeneWalk, and can be set to an arbitrary label.

The difference between the sif, sif_annot, and sif_full options:

The edge_list option is a simplified version of the sif option. It requires a network text file that contains rows with two columns each, a source and a target. In other words, it omits the relation type column from the SIF format. Further file preparation requirements are the same as for the sif option.

The indra option requires as custom network input file a Python pickle file containing a list of INDRA Statements. These statements can represent human gene-gene, as well as gene-GO relations from which network edges are derived. Human GO annotations and ontology relations between GO terms are then added automatically by GeneWalk during network construction.

Further documentation

For a tutorial and more general information see the GeneWalk website.
For further code documentation see our readthedocs page.

Citation

Robert Ietswaart, Benjamin M. Gyori, John A. Bachman, Peter K. Sorger, and L. Stirling Churchman
GeneWalk identifies relevant gene functions for a biological context using network representation learning,
Genome Biology 22, 55 (2021). https://doi.org/10.1186/s13059-021-02264-8

Funding

This work was supported by National Institutes of Health grant 5R01HG007173-07 (L.S.C.), EMBO fellowship ALTF 2016-422 (R.I.), and DARPA grants W911NF-15-1-0544 and W911NF018-1-0124 (P.K.S.).