CollasLab / edd

Enriched Domain Detector for ChIP-seq data
https://pypi.python.org/pypi/edd
MIT License
16 stars 4 forks source link

EDD - Enriched Domain Detector

EDD is a ChIP-seq peak caller for detection of megabase domains of enrichment.

Installation Requirements

At the moment, EDD only supports Python 2.7. The most CPU intensive parts of the algorithm have been written in Cython, which will be translated into C and compiled during installation. A consequence of this is that EDD has compile time dependencies on the following python packages: Cython, numpy and pysam. So these packages must be installed prior to installing EDD.

pip install Cython numpy pysam=0.11

Currently, as of May 2018, the most recent pysam versions are incompatible with EDD. Multiple other packages depends upon pysam. It is therefore recommended to install EDD in a python or conda virtual environment. Another posibillity is to create a docker or singularity container with EDD.

EDD currently uses the pybedtools library that again requires the bedtools program to be installed and on the path. See the bedtools documentation for installation instructions

Installation

The latest stable version of EDD can be installed using pip

pip install edd

The latest development version of EDD can be installed by running the following command from the source folder after downloading or checking out a new version:

python setup.py install

Usage

All the required and optional arguments to EDD are listed here and further explained below

edd [-h] [--bin-size BIN_SIZE] [-n NUM_TRIALS] [-p NPROCS] [--fdr FDR] [-g GAP_PENALTY] [--config-file CONFIG_FILE] [--write-log-ratios] [--write-bin-scores] [-v]

Required Arguments

Optional Arguments

Output Files

EDD creates three files in the output directory.

The peaks should always be compared against the bedgraph file in a genome browser. See the Selecting a negative score scale parameter example in the Additional section below.

Configuring EDD

In addition to required and optional run time parameters, EDD has some parameters where we think the default values will be preferable for most use cases. However, situations might arise where additional fine tuning might be required and we see no need to hide these parameters within the source code. The parameters in the file eddlib/default_parameters.conf is read at startup unless another configuration file is given with the optional run time argument --config-file. To change these values, either edit the configuration file in the EDD source code directly, or make and edit a copy and supply the path to this copy with the -config-file argument.

Additional

Input Files

The ip and input bam files are expected to be of the approximate same depth. EDD will perform a basic normalization by scaling input reads by a factor. This will not reflect biology if the difference between IP and input read depth is too large. It is therefore advisable to downsample the experiment with the higher read count instead of scaling up the lesser experiment by a factor. It is up to the researcher to decide when to downsample instead of letting EDD perform this simple normalization.

Obtain chromosome sizes

This can be extracted in various ways. For hg19, it is as simple as this (example borrowed from bedtools):

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome

Acquire unalignable regions

The unalignable regions file is a bed file that identifies regions that should be excluded from the analysis. More precisely, no peak can ever span an unalignable region. Typical candidates for exclusion are broad spanning repeat regions such as centromeres and telomeres. Failure to include a proper unalignable regions file for an analysis will increase the number of false positives among the detected peaks. An example unalignable regions file for hg19 can be found here.

This has been downloaded from the UCSC table browser using these options.

Understanding the gap penalty parameter

The gap penalty parameter decides how strongly EDD penalizes non-enriched bins (NIBs) within putative domains. The effect of this parameter is visualized below.

example picture illustrating how the gap penalty parameter affects the peaks found

This example displays an interesting region of a dataset analyzed with three different settings for the gap penalty parameter. The top track shows the bin scores, against which we visually evaluate the peak tracks. We first note that the detection of many domains is unaffected by changes to the gap penalty (on both flanks). However, a larger domain in the middle illustrates how domain detection is influenced by this parameter.

We found the track with the middle gap penalty (5) to best represent the source data. It is, however, important to understand that none of the other two tracks are wrong or illegal. The best choice depends on the user's interests and goals for further analysis.

It is also important to understand that what is a high and low gap penalty depends on the source data analyzed. A rule of thumb is that "high quality" ChIP-seq experiments ("quality" is influenced by e.g. the protein ChIPed, the antibody used, sequencing depth) can tolerate a higher gap penalty value due to less noise and therefore less heterogeneity.

Auto-estimating the gap penalty parameter

Even though auto-estimation has worked well for the data sets we have analyzed to date, resulting peaks should nevertheless be checked in a genome browser along with the bin score file that is created from each analysis. If the estimated parameter appears non optimal (see Understanding the gap penalty parameter above), then we recommend re-running the analysis with the gap penalty parameter set manually. The log file will indicate what the estimated gap penalty was, which can be used as a starting point.

Tested Operating Systems