README

coinfinder-logo

Coinfinder

A tool for the identification of coincident (associating and dissociating) genes in pangenomes.

Written in collaboration with Martin Rusilowicz. Coinfinder was developed in the McInerney laboratory.

What is it?

Coinfinder (pronounced "coin-finder") is an algorithm and software tool that detects genes which associate and dissociate with other genes more often than expected by chance in pangenomes. Coinfinder is written primarily in C++ and is a command line tool which generates text, gexf, and pdf outputs for the user.

Coinfinder uses a Bonferroni-corrected Binomial exact test statistic of the expected and observed rates of gene-gene association to evaluate whether a given gene pair is coincident.

When and why should I use it?

Coinfinder is designed to take as input a dataset of pangenomes and their genes. Ideally, genes will clustered into homologous gene clusters using a pangenomic tool such as Panaroo, Roary, PIRATE, or Pandora. Coinfinder should be used to identify coincident gene sets within a given pangenomic dataset. Coinfinder was written to identify coincident genes among strains of prokaryote species (i.e. a species pangenome) but can be extended to other pangenomic datasets.

Where can I read more about it?

Fiona J. Whelan, Martin Rusilowicz, & James O. McInerney. "Coinfinder: detecting significant associations and dissociations in pangenomes." doi: https://doi.org/10.1099/mgen.0.000338

Installation:

Coinfinder is available on Linux or macOS; it has not been developed for Windows.

Bioconda

If you use Conda: conda install -c defaults -c bioconda -c conda-forge coinfinder

If you use Mamba: mamba install -c defaults -c bioconda -c conda-forge coinfinder

(If the installation gets stuck on solving the environment, please see issue 36.)

Native install

Dependencies:

Cmake3.6 or greater https://cmake.org/download/ (brew install cmake on a Mac)
Python>3.6;<3.8 https://www.python.org/downloads/
Boost1.66 or greater https://www.boost.org/users/download/ (brew install boost on a Mac)
OpenMP (brew install llvm on a Mac)
gcc 5 or greater (default on most new-ish machines)
R libraries: caper, phytools, getopt, igraph, dplyr, cowplot, data.table, ggraph, flock, future
Bionconductor R library: ggtree https://bioconductor.org/packages/release/bioc/html/ggtree.html

Installation:

cmake -DCMAKE_BUILD_TYPE=Release .
cmake --build .
./coinfinder

On macOS, the default compiler may be clang instead of g++. If so, you may need to point the compiler to gcc; for example: export CC=/usr/local/bin/gcc-6; CXX=/usr/local/bin/g++-6; MPICXX=/usr/local/bin/mpic++

Methodology:

example-output

Usage:

coinfinder -i <gene information> [-I] -p <phylogeny> -o <output prefix> [--associate|--dissociate]

Coinfinder requires gene information and a phylogeny as input. The gene information can be provided in one of two formats: (a) as the gene_presence_absence.csv output from Roary; (b) as a tab-delimited list of genes present in each strain. An example of a tab-delimited list of genes:

gene_1  genome_1
gene_1  genome_2
gene_1  genome_3
gene_2  genome_2
gene_2  genome_3
gene_3  genome_1
gene_3  genome_2

Note: the gene_presence_absence.csv output from Panaroo appears to differ from Roary in that fields are not surrounded by double-quotes. Coinfinder assumes this double-quote format; you could use something like the following to correct for this:

sed -e 's/^/"/g' -e 's/$/"/g' -e 's/,/","/g' gene_presence_absence.csv > gene_presence_absence-withquotes.csv

The phylogeny should be Newick-formatted with no zero-length branches. We suggest that this phylogeny be constructed using the core gene information (for example, as suggested in the Roary pipeline https://sanger-pathogens.github.io/Roary/).

Lastly, the user must decide between running Coinfinder to find associations (gene pairs present together) or dissociations (gene pairs which are present apart, or avoid each other).

For more information on usage, please see coinfinder -h:

File input- specify either: 
    -i or --input          The path to the gene_presence_absence.csv output from Roary
                           -or-
                           The path of the Alpha-to-Beta file with (alpha)(TAB)(beta)
    -I or --inputroary     Set if -i is in the gene_presence_absence.csv format from Roary
    -p or --phylogeny      Phylogeny of Betas in Newick format (required)
Max mode (mandatory for coincidence analysis):
    -a or --associate      Overlap; identify groups that tend to associate/co-occur.
    -d or --dissociate     Separation; identify groups that tend to dissociate/avoid.
Significance- specify: 
    -L or --level          Specify the significnace level cutoff (default: 0.05)
Significance correction- specify: 
    -m or --bonferroni     Bonferroni correction multiple correction (recommeneded)
    -n or --nocorrection   No correction, use value as-is
    -c or --fraction       (Connectivity analysis only) Use fraction rather than p-value
Alternative hypothesis- specify: 
    -g or --greater        Greater (recommended)
    -l or --less           Less
    -t or --twotailed      Two-tailed
Miscellaneous:
    -x or --num_cores      The number of cores to use (default: 2)
    -v or --verbose        Verbose output.
    -r or --filter         Permit filtering of saturated and low-abundance data.
    -U or --upfilthreshold Upper filter threshold for high-abundance data filtering (default: 1.0 i.e. any alpha in >=100/% of betas.
    -F or --filthreshold   Threshold for low-abundance data filtering (default: 0.05 i.e. any alpha in <=5% of betas.
    -q or --query          The path to a file containing a list of genes to specificcally query, one per line (optional).
    -T or --test           Runs the test cases and exits.
    -E or --all            Outputs all results, regardless of significance.
Output:
    -o or --output         The prefix of all output files (default: coincident).

To get the version of coinfinder, simply type coinfinder without any flag optoins.

Example output:

example-output

An example association network in which each gene (node) is connected to another gene with a line (edge) iff they statistically co-occur with each other. Nodes are weighted by lineage-independence in the phylogeny (i.e. the larger the node, the more phylogenetically independent the gene). Nodes are coloured by connected component, or the set of genes with associative relationships with each other. This data can also be shown as a presence/absence heatmap in relation to the phylogeny (note: this heatmap is a subset of all results; in particular, the large wine coloured gene set has been removed for ease of visibility). The association network displayed in part A was made by inputting the coinfinder output .gephi file into the Gephi software (https://gephi.org/). The heatmap displayed in part B is part of the coinfinder default output.

Example usage:

The example dataset, including input and expected output files using the associated manuscript can be found here.

Citation information:

@article{mbs:/content/journal/mgen/10.1099/mgen.0.000338, author = "Whelan, Fiona Jane and Rusilowicz, Martin and McInerney, James Oscar", title = "Coinfinder: detecting significant associations and dissociations in pangenomes", year = "2020", publisher = "Microbiology Society", url = "https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000338", doi = "https://doi.org/10.1099/mgen.0.000338", keywords = "pangenome", keywords = "gene association networks", keywords = "gene co-occurrence", abstract = "The accessory genes of prokaryote and eukaryote pangenomes accumulate by horizontal gene transfer, differential gene loss, and the effects of selection and drift. We have developed Coinfinder, a software program that assesses whether sets of homologous genes (gene families) in pangenomes associate or dissociate with each other (i.e. are ‘coincident’) more often than would be expected by chance. Coinfinder employs a user-supplied phylogenetic tree in order to assess the lineage-dependence (i.e. the phylogenetic distribution) of each accessory gene, allowing Coinfinder to focus on coincident gene pairs whose joint presence is not simply because they happened to appear in the same clade, but rather that they tend to appear together more often than expected across the phylogeny. Coinfinder is implemented in C++, Python3 and R and is freely available under the GNU license from https://github.com/fwhelan/coinfinder. " }

What if I find a bug or have an issue running coinfinder?

If you run into any issues with coinfinder, we want to hear about it! Please don't be shy, and log an Issue including as much of the following as possible:

The exact command that you used to call coinfinder (helps us identify where in the code the bug might be).
A reproducible example of the issue with a small dataset that you can share (helps us identify whether the issue is specific to a particular computer, operating system, and/or dataset).

fwhelan / coinfinder

readme

README

Coinfinder

A tool for the identification of coincident (associating and dissociating) genes in pangenomes.

What is it?

When and why should I use it?

Where can I read more about it?

Installation:

Bioconda

Native install

Dependencies:

Installation:

Methodology:

Usage:

Example output:

Example usage:

Citation information:

What if I find a bug or have an issue running coinfinder?