GarrettJenkinson / informME

An information-theoretic pipeline for methylation analysis of WGBS data
GNU General Public License v3.0
25 stars 8 forks source link
bash bioinformatics cpp information-theory ising-model matlab methylation next-generation-sequencing parallel-computing parameter-estimation r sge slurm

informME

An information-theoretic pipeline for methylation analysis of WGBS data

LATEST RELEASE: v0.3.3


This directory contains the MATLAB, C++ and R software that implements informME, as well as the BASH wrappers and example submission scripts used to run informME on a Sun Grid Engine (SGE) cluster or a SLURM computer cluster.

informME is an information-theoretic approach to methylation analysis developed in [1] and [2], using the Ising model of statistical physics. This method is applied on BAM files with reads aligned to a reference genome, which are generated during a whole-genome bisulphite sequencing (WGBS) experiment, and produces genome-wide information about the statistical properties of methylation in a single-sample and in a differential analysis framework. The resulting information is stored in multiple MATLAB MAT files for subsequent processing and is summarized by bedGraph genomic tracks that can be visualized using a genome browser (such as the UCSC genome browser, see https://genome.ucsc.edu).

The current implementation of informME has been tested within the following environments:

and by using the following tools:

A. DIRECTORY STRUCTURE

informME includes the following directories:

  1. informME/bin - contains softlinks to all executable bash scripts

  2. informME/src - contains all bash wrappers as well as the MATLAB, C++, and R code used by informME

  3. informME/third_party - contains third party MATLAB software used by informME

  4. informME/cluster - contains templates for running informME on SGE and SLURM clusters

B. DEPENDENCIES

It is required to properly configure a MATLAB C++ mex compiler. For details, see:

A working SAMtools installation needs to be on the system path as well. The remaining dependencies (GMP, MPFR, MPREAL, and Eigen) are all C++ dependencies that the installation script will install as needed.

NOTE: The required "mex compiler" is different from the "Matlab compiler SDK". InformME requires the former and not the latter.

C. INSTALLING InformME

Run install.sh in the informME directory. During the interactive installation process the user will be asked different questions regarding the locations of default directories. Dependencies, such as GMP, MPFR, MPREAL, and Eigen, will be automatically installed during this process if these are not already available, and the C++ MEX code of informME will be compiled. The optional postprocessing utilities may have their own dependencies as noted in their corresponding documentation below.

Note 1: the following environment variables will optionally be defined in a configuration file stored in ~/.informME/informME.config and then will be accessed throughout multiple points in the informME pipeline:

Note 2: The previous environment variables can be overwritten through optional arguments when running any part of informME. If these variables are not defined, then the user must pass the corresponding paths as optional arguments.

D. RUNNING informME

Run the informME software using the following steps in the indicated order (type the name of the command with no arguments to see the help file with instructions in each case).

D.1. REFERENCE GENOME ANALYSIS:

    fastaToCpg.sh [OPTIONS] FASTA_FILE

This step analyzes the reference genome FASTA_FILE (in FASTA format) and produces a MATLAB MAT file CpGlocationChr#.mat for each chromosome, which is stored by default in REFGENEDIR, and contains the following information:

NOTE1: This step only needs to be completed one time for a given reference genome. Start analyzing samples at step D.2 if you have previously completed step D.1 for your sample's reference genome.

NOTE2: At this time the statistical model of informME has been designed to work only with autosomes, and so the informME software will not model mitochondrial chromosomes, lambda spike-ins, partial contigs, sex chromosomes, et cetera. Also the reference fasta file to which bam files have been aligned is assumed to be sorted so that the somatic chromosomes come first and in the usual order: chr1,chr2,...,chrN.

D.2. METHYLATION DATA MATRIX GENERATION:

    getMatrices.sh [OPTIONS] BAM_FILE CHR_NUM

This step takes the BAM file BAM_FILE as input and generates the methylation data matrix for chromosome number CHR_NUM. By default, the file BAM_FILE and its associated index file (with extension .bai) is expected to be in BAMDIR, and the output file produced by this step is stored in a subdirectory in INTERDIR named after the chromosome number CHR_NUM. The output file preserves the prefix from the file BAM_FILE and the suffix '_matrices.mat' is appended to it (e.g. if BAM_FILE is normal_sample.bam and CHR_NUM is 10, then the output file is saved as INTERDIR/chr10/normal_sample_matrices.mat). The file produced contains the following information for each genomic region, which is subsequently used for model estimation:

NOTE1: We recommend taking advantage of the array feature available in SGE and SLURM based clusters to submit an individual job for each chromosome.

NOTE2: See reference [1], "Online Methods: Quality control and alignment" for our suggested preprocessing steps when generating a sorted, indexed, deduplicated BAM file to input to informME.

D.3. MODEL ESTIMATION & ANALYSIS:

    informME_run.sh [OPTIONS] MAT_FILES PHENO CHR_NUM

This step is comprised of two phases. During the first phase, informME learns the parameters of the Ising probability distribution by combining the methylation data matrices provided through the argument MAT_FILES (comma-separated list) for chromosome number CHR_NUM. By default, the MAT_FILES are expected to be in a subdirectory named after CHR_NUM in INTERDIR. The output generated during this phase is also stored in a subdirectory in INTERDIR named after chromosome number CHR_NUM. The output file has as prefix PHENO and the suffix '_fit.mat' appended to it (e.g. if 'normal' is the PHENO, and CHR_NUM is 10, then the output is stored as INTERDIR/chr10/normal_fit.mat). The file produced contains the following information:

The second phase of this step consists in analyzing the model learned by computing a number of statistical summaries of the methylation state, including probability distributions of methylation levels, mean methylation levels, and normalized methylation entropies, as well as mean and entropy based classifications. This step also computes entropic sensitivity indices, methylation sensitivity indices, as well information-theoretic quantities associated with methylation channels, such as turnover ratios, channel capacities, and relative dissipated energies. The output generated during this phase is stored in the same directory as the output generated during the first phase, using the same prefix as before. However, the suffix is now '_analysis.mat' (e.g. following the previous example, the output file of this phase is stored as INTERDIR/chr10/normal_analysis.mat). This file contains the following information:

NOTE: We recommend taking advantage of the array feature available in SGE and SLURM based clusters to submit an individual job for each chromosome.

D.4. GENERATE BED FILES FOR SINGLE ANALYSIS:

    singleMethAnalysisToBed.sh [OPTIONS] PHENO

This function makes BED files from the methylation analysis results obtained after running informME_run.sh for a given phenotype PHENO. By default, the input file (analysis file) is expected to be located in INTERDIR/chr#/PHENO_analysis.mat. In addition, the output files are stored in FINALDIR and have the following names and content:

D.5. GENERATE BED FILES FOR DIFFERENTIAL ANALYSIS:

diffMethAnalysisToBed.sh [OPTIONS] PHENO1 PHENO2

This function makes BED files for the differential methylation analysis results obtained after running informME_run.sh for two given phenotypes PHENO1 and PHENO2. By default, the input files (both analysis files) are expected to be located in INTERDIR/chr#/PHENO1_analysis.mat and INTERDIR/chr#/PHENO2_analysis.mat respectively. In addition, the output files are stored in FINALDIR and have the following names and content:

D.6. POST-PROCESSING

D.6.1. BED TO BW CONVERSION

The user can employ a provided utility to convert BED files generated by informME to much smaller BigWig (BW) files. This can be done by running the command:

./bed2bw.sh "path" "asy"

where

NOTE 1: Type bed2bw.sh to get more information about this utility

NOTE 2: For this utility, the following tools must be installed in $PATH: bedtools, bedClip, bedGraphToBigWig, and fetchChromSizes

D.6.2. DMR DETECTION

The user can use a provided utility to perform DMR detection using the Jensen-Shannon distance (JSD) based on the method described in [2]. This utility must be run within an R session.

usage (when replicate reference data is available):

setwd("/path/to/informME/src/R_src/")
source("jsDMR.R") 
runReplicateDMR(refVrefFiles,testVrefFiles,inFolder,outFolder)

where

usage (when no replicate reference data is available)

source("/path/to/informME/src/R_src/jsDMR.R") 
runNoReplicateDMR(JSDfile,inFolder,outFolder)

where

NOTE 1: More information about these utilities can be found in informME/src/R_src/postprocess/README.txt

NOTE 2: For this utilities, the following tools must be installed in R: rtracklayer, logitnorm, mixtools, annotatr, and Homo.sapiens

D.6.3. GENE RANKING

The user can use a provided utility to rank all Human genes in the Bioconductor library TxDb.Hsapiens.UCSC.hg19.knownGene using the average mutual information based on the method described in [3]. This utility must be run within an R session.

usage (when replicate reference data is available):

source("/path/to/informME/src/R_src/jsGrank.R")
rankGenes(refVrefFiles,testVrefFiles,inFolder,outFolder,tName,rName)

where

In this case, the function generates the file gRank-JSD-tName-VS-rName.tsv.

usage (when no replicate reference data is available and rankings will be done by average promoter JSD):

setwd("path/to/informME/src/R_src/")
source("jsGrank.R")
rankGenes(c(),testVrefFiles,inFolder,outFolder,tName,rName)

where

In this case, the function generates the file gRankProms-JSD-tName-VS-rName.tsv.

NOTE 1: More information about this utility can be found in informME/src/R_src/postprocess/README.txt

NOTE 2: For this utility, the following tools must be installed in R: GenomicFeatures, GenomicRanges, rtracklayer, TxDb.Hsapiens.UCSC.hg19.knownGene, gamlss, Homo.sapiens

D.6.4. REGION RANKING

The user can use a provided utility to rank all regions in a BED file using the average mutual information based on the method described in [3]. This utility must be run within an R session.

usage (replicate reference data is required):

source("/path/to/informME/src/R_src/jsGrank.R")
rankRegions(refVrefFiles,testVrefFiles,regionsFile,regionsName,inFolder,outFolder,tName,rName)

where

In this case, the function generates the file regionRankings-tName-VS-rName.tsv.

NOTE 1: More information about this utility can be found in informME/src/R_src/postprocess/README.txt

NOTE 2: For this utility, the following tools must be installed in R: GenomicFeatures, GenomicRanges, rtracklayer, TxDb.Hsapiens.UCSC.hg19.knownGene, gamlss, Homo.sapiens

REFERENCES

[1] Jenkinson, G., Pujadas, E., Goutsias, J., and Feinberg, A.P. (2017), Potential energy landscapes identify the information-theoretic nature of the epigenome, Nature Genetics, 49: 719-729.

[2] Jenkinson, G., Abante, J., Feinberg, A.P., and Goutsias, J. (2018), An information-theoretic approach to the modeling and analysis of whole-genome bisulfite sequencing data, BMC Bioinformatics, 19:87, https://doi.org/10.1186/s12859-018-2086-5.

[3] Jenkinson, G., Abante, J., Koldobskiy, M., Feinberg, A.P., and Goutsias, J. (2019), Ranking genomic features using an information-theoretic measure of epigenetic discordance, BMC Bioinformatics, 20:175, https://doi.org/10.1186/s12859-019-2777-6.

WIKI

See our wiki for additional examples and a more readable introduction to informME.

VERSION HISTORY

v0.3.3 - Added ranking capabilities for promoters, gene bodies, and arbitrary genomic intervals, based on mutual information (see [3] for details).

v0.3.2 - Added methylation sensitivity index (MSI) to output. Improved checking of user inputs to catch errors early. Tabulate output bedGraph files. Documentation improvements.

v0.3.1 - More informative error messages. Fix UI to expose ability to model single-ended sequencing reads. Documentation improvements.

v0.3.0 - Code reorganized into a source and bin directory structure. Wrapper shell scripts added to all user callable functions. I/O directories no longer inside code directories. General UI improvements for easier usage.

v0.2.2 - Minor updates in gene ranking and small updates getting ready for the next major reorganization release.

v0.2.1 - Added R utilities for DMR detection and gene ranking using the Jensen-Shannon distance (JSD). Various documentation improvements.

v0.2.0 - Code reorganized into more specialized directories, streamlined, and general SGE submission scripts provided as a guide. Updated README's. Variable names changed to reflect published notation.

v0.1.0 - Initial release. Code widely tested internally. Code used to create results for ref [1].

LICENCING

All code authored by Garrett Jenkinson or Jordi Abante in informME is licensed under a GPLv3 license; exceptions to GPL licensing are the files contained in the following directories:

These files have their own licensing information in their headers. Thanks to Arnold Neumaier and Ali Mohammad-Djafari for their permissions to modify and distribute their software with informME.