eblancoga / spikChIP

spikChIP is a computational pipeline to perform multiple alternative normalization corrections on ChIP-seq data with spike-in (Raw, Traditional, ChIP-Rx, Tag removal and spikChIP)
GNU General Public License v3.0
3 stars 1 forks source link

SpikChIP

A novel computational methodology to compare multiple ChIP-seq using spike-in chromatin

Enrique Blanco, Luciano Di Croce, Sergi Aranda

What is spikChIP?

spikChIP is a standalone pipeline designed in Perl to perform the normalization of multiple ChIP-seq experiments with spike-in. spikChIP adopts a wide range of available normalization strategies (raw, traditional, ChIP-Rx, tag removal). In addition, it implements a new approach based in local regression that is able to focus the impact of the normalization over the ChIP signal enriched regions (previously identified by external peak callers), diminishing secondary effects of the normalization over the background.

How to run spikChIP

spikChIP is a command line that runs in Linux and Mac OS-X environments:

./spikChIP -vhcde <FLY|HUMAN|MOUSE|EXO> -k <bin_size_kbp> -p <0|1|2|3> <configuration_file> <chrominfo_file>

Users can configure the behavior of the program with the following options:

-c: remove intermediate files to reduce the size of the output folder
-d: allow the process of BAM files of < 1 Million reads
-e: exogenous genome identifier (default: FLY, allowed values: FLY,HUMAN,MOUSE,EXO)
-k: number of bp of the bins for the segmentation of the genome
-p: palette of colors for boxplots (0: black and white, 1: reds, 2: greens, 3: blues)
-v: verbose (show messages about the processing of the samples)
-h: short help

The configuration file of spikChIP is a plain-text file in which each line contains the information about the files of a particular experimental condition:

#sample  bam_sample bam_spike   peaks_sample    peaks_spike
condition1  map_files/1_sample.bam  map_files/1_spike.bam   peak_files/1_sample_peaks.bed   peak_files/1_spike_peaks.bed
condition2  map_files/2_sample.bam  map_files/2_spike.bam   peak_files/2_sample_peaks.bed   peak_files/2_spike_peaks.bed
...

BAM files of each sample/spike genome (e.g. human reads and fruit fly reads) must be generated with a mapping tool prior to running spikChIP by the user. Peak files in BED format must be also generated with a peak calling tool before too. We employed Bowtie for mapping (multi-locus reads discarded, http://bowtie-bio.sourceforge.net) and MACS2 for peak calling (broad peaks, https://pypi.org/project/MACS2).

The ChromInfo file is a plain-text file with the list of chromosomes from the sample and spike-in genomes. This is an example for such a genome type made of human (hg19) and fruit fly (dm3):

chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
chr6    171115067
chr7    159138663
chr8    146364022
chr9    141213431
chr10   135534747
chr11   135006516
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
chr16   90354753
chr17   81195210
chr18   78077248
chr19   59128983
chr20   63025520
chr21   48129895
chr22   51304566
chrX    155270560
chrY    59373566
chr2L_FLY   23011544
chr2R_FLY   21146708
chr3L_FLY   24543557
chr3R_FLY   27905053
chr4_FLY    1351857
chrX_FLY    22422827

Please, notice that we use the "FLY" suffix to denote the chromosomes of the spike-in species. Moreover, it is mandatory that such chromosome names are identically written to describe the location of the reads in the BAM files of the sample and spike-in experiments for all the conditions as formally indicated in the configuration file.

Software requirements

The following programs must be installed in the computer (PATH variable):

http://ldicrocelab.crg.eu/index.php https://github.com/eblancoga/seqcode

http://www.htslib.org/

http://www.gnu.org

https://www.r-project.org/ The affy and MASS libraries must be installed

Input folders and files

The choices in the names of the folders are optional. Users can decide which folders to use when running spikChIP in the command line.

Output folders and files

spikChIP will create/recycle the following folders in the working directory.

(1) Genome-wide list of bins in the human genome (sample) and in the fruit fly (spike-in) with the values normalized using the SPIKCHIP (RAW, TRADITIONAL, CHIPRX and TAG REMOVAL are other normalization approaches also calculated by the program):

FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_sample.txt.gz
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_spike.txt.gz

(2) List of non-null bins belonging to peaks or background regions in samples/spike-ins:

FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_sample_avg_bg.txt.gz
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_sample_avg_peaks.txt.gz
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_spike_avg_bg.txt.gz
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_spike_avg_peaks.txt.gz

Boxplots automatically generated with and without labels in PDF to show the distribution of corrected values by each normalization strategy calculated by the spikChIP script for the current set of ChIP-seq experiments.

Scripts written in R to perform the local regression and the final boxplots.

Main stages of the analysis

The computational pipeline of spikChIP consists of the following stages:

(Step 0) Read the options, load the configuration file and check the existence of BAM and BED files.

(Step 1) Segmentation into bins of the same size of the "sample" genome (e.g. human) and the "spike-in" genome (e.g. fruit fly) according to the list of chromosomes as declared in the ChromInfo.txt file.

(Step 2) Perform the counting of reads inside the bins (average and maximum values) and normalize these initial values with multiple approaches (e.g. raw, traditional, ChIP-Rx, tag removal and spikChIP).

(Step 3) Classification of the bins of both genomes into peaks or background regions according to the peaks provided by the user for each experiment as declared in the configuration file.

(Step 4) Generation of the final boxplots to show the distribution of normalized values for each normalization strategy and class of region (peaks and background) using average or maximum values.

(Step 5) Cleaning temporary files and compressing resulting final files of all normalization strategies.

How to generate Custom tracks from spikChIP final results to be visualized in the UCSC browser:

Users can generate custom tracks in BedGraph format from the final files of genome-wide bins. Below, we show how to generate the human genome track from the spikChIP results:

FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_sample.txt.gz

chr1*1*1001 0.0997764035926034  0.100224097481314
chr1*100000001*100001001    0.11228442208177    0.107762054394226
chr1*10000001*10001001  0.191068011376152   0.15125504155222
...

zcat results/FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_sample.txt.gz | sed 's/\*/ /g' | gawk 'BEGIN{print "track type=bedGraph name=ALL_AVG_SPIKCHIP_25 description=ALL_AVG_SPIKCHIP_25 visibility=full maxHeightPixels=60 color=200,0,0";OFS="\t"}{print $1,$2,$3,$4}' > UCSCtracks/SPIKCHIP_25.bg; gzip UCSCtracks/SPIKCHIP_25.bg;

track type=bedGraph name=ALL_AVG_SPIKCHIP_25 description=ALL_AVG_SPIKCHIP_25 visibility=full maxHeightPixels=60 color=200,0,0
chr1    1   1001    0.0997764035926034
chr1    100000001   100001001   0.11228442208177
chr1    10000001    10001001    0.191068011376152
...

zcat results/FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_sample.txt.gz | sed 's/\*/ /g' | gawk 'BEGIN{print "track type=bedGraph name=ALL_AVG_SPIKCHIP_75 description=ALL_AVG_SPIKCHIP_75 visibility=full maxHeightPixels=60 color=255,100,100";OFS="\t"}{print $1,$2,$3,$5}' > UCSCtracks/SPIKCHIP_75.bg; gzip UCSCtracks/SPIKCHIP_75.bg;

track type=bedGraph name=ALL_AVG_SPIKCHIP_75 description=ALL_AVG_SPIKCHIP_75 visibility=full maxHeightPixels=60 color=255,100,100
chr1    1   1001    0.100224097481314
chr1    100000001   100001001   0.107762054394226
chr1    10000001    10001001    0.15125504155222
...

Those are the the final files of human genome-wide bins for the other classes of normalizations performed here:

results/FINAL_H3K79me2_25-75-75-25_RAW_1000_sample_avg.txt.gz
results/FINAL_H3K79me2_25-75-75-25_RAW_1000_sample_max.txt.gz
results/FINAL_H3K79me2_25-75-75-25_TRADITIONAL_1000_sample_avg.txt.gz
results/FINAL_H3K79me2_25-75-75-25_TRADITIONAL_1000_sample_max.txt.gz
results/FINAL_H3K79me2_25-75-75-25_CHIPRX_1000_sample_avg.txt.gz
results/FINAL_H3K79me2_25-75-75-25_CHIPRX_1000_sample_max.txt.gz
results/FINAL_H3K79me2_25-75-75-25_TAGREMOVAL_1000_sample_avg.txt.gz
results/FINAL_H3K79me2_25-75-75-25_TAGREMOVAL_1000_sample_max.txt.gz

We have created a UCSC session with all tracks generated by spikChIP using all normalization strategies:

https://genome.ucsc.edu/s/DiCroceLab/spikChIP_NAR%2DGB_2021

Benchmarking datasets

We have used multiple ChIP-seq datasets with spike-in from three different publications:

Orlando DA, Chen MW, Brown VE, Solanki S et al. Quantitative ChIP-Seq normalization reveals global modulation of the epigenome. Cell Rep 2014 Nov 6;9(3):1163-70. PMID: 25437568

https://www.ncbi.nlm.nih.gov/pubmed/25437568

[Jurkat_K79_25%_R1]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465005

[Jurkat_K79_75%_R1]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465007

./spikChIP.pl -vcp 1 config/H3K79me2_25-75_config.txt config/ChromInfo.txt 2> logs/H3K79me2_25-75.log;

Orlando DA, Chen MW, Brown VE, Solanki S et al. Quantitative ChIP-Seq normalization reveals global modulation of the epigenome. Cell Rep 2014 Nov 6;9(3):1163-70. PMID: 25437568

https://www.ncbi.nlm.nih.gov/pubmed/25437568

[Jurkat_K79_0%_R1]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465004

[Jurkat_K79_25%_R1]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465005

[Jurkat_K79_50%_R1]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465006

[Jurkat_K79_75%_R1]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465007

[Jurkat_K79_100%_R1]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465008

./spikChIP.pl -vcp 1 config/H3K79me2_0-25-50-75-100_config.txt config/ChromInfo.txt 2> logs/H3K79me2_0-25-50-75-100.log;

Egan B, Yuan C, Craske ML, Labhart P et al. An Alternative Approach to ChIP-Seq Normalization Enables Detection of Genome-Wide Changes in Histone H3 Lysine 27 Trimethylation upon EZH2 Inhibition. PLoS One. 2016 Nov 22;11(11):e0166438.

https://pubmed.ncbi.nlm.nih.gov/27875550

[PC9_control_H3K27me3_Dmspike]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1890165

[PC9_EZH2inh_H3K27me3_Dmspike]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1890166

./spikChIP.pl -vcp 3 config/H3K27me3_control-EZH2inh_config.txt config/ChromInfo.txt 2> logs/H3K27me3_control-EZH2inh.log;

Guertin MJ, Cullen AE, Markowetz F, Holding AN. Parallel factor ChIP provides essential internal control for quantitative differential ChIP-seq. Nucleic Acids Res 2018 Jul 6;46(12):e75.

https://www.ncbi.nlm.nih.gov/pubmed/29672735

[SLX-8047_1b_ER_none]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2747692

[SLX-8047_1a_ER_Fulvestrant]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2747691

[SLX-8047_2b_ER_none]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2747694

[SLX-8047_2a_ER_Fulvestrant]

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2747693

./spikChIP.pl -vcp 0 config/ER_R1R2_config.txt config/ChromInfo.txt 2> logs/ER_R1R2.log;

Examples of running messages of spikChIP for multiple datasets

Users can find in the logs/ folder the whole list of messages provided by spikChIP for processing each dataset.

References to cite spikChIP:

Quantitative comparison of multiple chromatin immunoprecipitation-sequencing (ChIP-seq) experiments with spikChIP. E. Blanco, C. Ballaré, L. Di Croce, and S. Aranda. Methods in Molecular Biology (2023) 2624: 55-72.

https://pubmed.ncbi.nlm.nih.gov/36723809/

SpikChIP: a novel computational methodology to compare multiple ChIP-seq using spike-in chromatin. Enrique Blanco, Luciano Di Croce, Sergi Aranda. NAR Genom Bioinform. (2021) Jul 27;3(3):lqab064.

https://pubmed.ncbi.nlm.nih.gov/34327329/

Comparative ChIP-seq (Comp-ChIP-seq): a novel computational methodology for genome-wide analysis. Enrique Blanco, Luciano Di Croce, Sergi Aranda. bioRxiv 532622 (2019).

https://www.biorxiv.org/content/10.1101/532622v2