spikChIP is a standalone pipeline designed in Perl to perform the normalization of multiple ChIP-seq experiments with spike-in. spikChIP adopts a wide range of available normalization strategies (raw, traditional, ChIP-Rx, tag removal). In addition, it implements a new approach based in local regression that is able to focus the impact of the normalization over the ChIP signal enriched regions (previously identified by external peak callers), diminishing secondary effects of the normalization over the background.
spikChIP is a command line that runs in Linux and Mac OS-X environments:
./spikChIP -vhcde <FLY|HUMAN|MOUSE|EXO> -k <bin_size_kbp> -p <0|1|2|3> <configuration_file> <chrominfo_file>
Users can configure the behavior of the program with the following options:
-c: remove intermediate files to reduce the size of the output folder
-d: allow the process of BAM files of < 1 Million reads
-e: exogenous genome identifier (default: FLY, allowed values: FLY,HUMAN,MOUSE,EXO)
-k: number of bp of the bins for the segmentation of the genome
-p: palette of colors for boxplots (0: black and white, 1: reds, 2: greens, 3: blues)
-v: verbose (show messages about the processing of the samples)
-h: short help
The configuration file of spikChIP is a plain-text file in which each line contains the information about the files of a particular experimental condition:
#sample bam_sample bam_spike peaks_sample peaks_spike
condition1 map_files/1_sample.bam map_files/1_spike.bam peak_files/1_sample_peaks.bed peak_files/1_spike_peaks.bed
condition2 map_files/2_sample.bam map_files/2_spike.bam peak_files/2_sample_peaks.bed peak_files/2_spike_peaks.bed
...
BAM files of each sample/spike genome (e.g. human reads and fruit fly reads) must be generated with a mapping tool prior to running spikChIP by the user. Peak files in BED format must be also generated with a peak calling tool before too. We employed Bowtie for mapping (multi-locus reads discarded, http://bowtie-bio.sourceforge.net) and MACS2 for peak calling (broad peaks, https://pypi.org/project/MACS2).
The ChromInfo file is a plain-text file with the list of chromosomes from the sample and spike-in genomes. This is an example for such a genome type made of human (hg19) and fruit fly (dm3):
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chr8 146364022
chr9 141213431
chr10 135534747
chr11 135006516
chr12 133851895
chr13 115169878
chr14 107349540
chr15 102531392
chr16 90354753
chr17 81195210
chr18 78077248
chr19 59128983
chr20 63025520
chr21 48129895
chr22 51304566
chrX 155270560
chrY 59373566
chr2L_FLY 23011544
chr2R_FLY 21146708
chr3L_FLY 24543557
chr3R_FLY 27905053
chr4_FLY 1351857
chrX_FLY 22422827
Please, notice that we use the "FLY" suffix to denote the chromosomes of the spike-in species. Moreover, it is mandatory that such chromosome names are identically written to describe the location of the reads in the BAM files of the sample and spike-in experiments for all the conditions as formally indicated in the configuration file.
The following programs must be installed in the computer (PATH variable):
http://ldicrocelab.crg.eu/index.php https://github.com/eblancoga/seqcode
https://www.r-project.org/ The affy and MASS libraries must be installed
The choices in the names of the folders are optional. Users can decide which folders to use when running spikChIP in the command line.
config/: It contains the configuration files employed for the examples shown in the article.
map_files/: Please, notice that we do not provide the BAM files of the examples shown in the article to save space. However, raw data of each example from NCBI GEO is appropriately referenced elsewhere to perform the mapping apart. Users should deposit the BAM files of sample/spike genomes within this folder.
peak files/: It contains the BED files employed for the examples shown in the article. Users should deposit the BED files of peaks of their particular examples.
spikChIP will create/recycle the following folders in the working directory.
(1) Genome-wide list of bins in the human genome (sample) and in the fruit fly (spike-in) with the values normalized using the SPIKCHIP (RAW, TRADITIONAL, CHIPRX and TAG REMOVAL are other normalization approaches also calculated by the program):
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_sample.txt.gz
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_spike.txt.gz
(2) List of non-null bins belonging to peaks or background regions in samples/spike-ins:
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_sample_avg_bg.txt.gz
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_sample_avg_peaks.txt.gz
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_spike_avg_bg.txt.gz
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_spike_avg_peaks.txt.gz
Boxplots automatically generated with and without labels in PDF to show the distribution of corrected values by each normalization strategy calculated by the spikChIP script for the current set of ChIP-seq experiments.
Scripts written in R to perform the local regression and the final boxplots.
The computational pipeline of spikChIP consists of the following stages:
(Step 0) Read the options, load the configuration file and check the existence of BAM and BED files.
(Step 1) Segmentation into bins of the same size of the "sample" genome (e.g. human) and the "spike-in" genome (e.g. fruit fly) according to the list of chromosomes as declared in the ChromInfo.txt file.
(Step 2) Perform the counting of reads inside the bins (average and maximum values) and normalize these initial values with multiple approaches (e.g. raw, traditional, ChIP-Rx, tag removal and spikChIP).
(Step 3) Classification of the bins of both genomes into peaks or background regions according to the peaks provided by the user for each experiment as declared in the configuration file.
(Step 4) Generation of the final boxplots to show the distribution of normalized values for each normalization strategy and class of region (peaks and background) using average or maximum values.
(Step 5) Cleaning temporary files and compressing resulting final files of all normalization strategies.
Users can generate custom tracks in BedGraph format from the final files of genome-wide bins. Below, we show how to generate the human genome track from the spikChIP results:
FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_sample.txt.gz
chr1*1*1001 0.0997764035926034 0.100224097481314
chr1*100000001*100001001 0.11228442208177 0.107762054394226
chr1*10000001*10001001 0.191068011376152 0.15125504155222
...
zcat results/FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_sample.txt.gz | sed 's/\*/ /g' | gawk 'BEGIN{print "track type=bedGraph name=ALL_AVG_SPIKCHIP_25 description=ALL_AVG_SPIKCHIP_25 visibility=full maxHeightPixels=60 color=200,0,0";OFS="\t"}{print $1,$2,$3,$4}' > UCSCtracks/SPIKCHIP_25.bg; gzip UCSCtracks/SPIKCHIP_25.bg;
track type=bedGraph name=ALL_AVG_SPIKCHIP_25 description=ALL_AVG_SPIKCHIP_25 visibility=full maxHeightPixels=60 color=200,0,0
chr1 1 1001 0.0997764035926034
chr1 100000001 100001001 0.11228442208177
chr1 10000001 10001001 0.191068011376152
...
zcat results/FINAL_H3K79me2_25-75-75-25_SPIKCHIP_1000_avg_normalized_sample.txt.gz | sed 's/\*/ /g' | gawk 'BEGIN{print "track type=bedGraph name=ALL_AVG_SPIKCHIP_75 description=ALL_AVG_SPIKCHIP_75 visibility=full maxHeightPixels=60 color=255,100,100";OFS="\t"}{print $1,$2,$3,$5}' > UCSCtracks/SPIKCHIP_75.bg; gzip UCSCtracks/SPIKCHIP_75.bg;
track type=bedGraph name=ALL_AVG_SPIKCHIP_75 description=ALL_AVG_SPIKCHIP_75 visibility=full maxHeightPixels=60 color=255,100,100
chr1 1 1001 0.100224097481314
chr1 100000001 100001001 0.107762054394226
chr1 10000001 10001001 0.15125504155222
...
Those are the the final files of human genome-wide bins for the other classes of normalizations performed here:
results/FINAL_H3K79me2_25-75-75-25_RAW_1000_sample_avg.txt.gz
results/FINAL_H3K79me2_25-75-75-25_RAW_1000_sample_max.txt.gz
results/FINAL_H3K79me2_25-75-75-25_TRADITIONAL_1000_sample_avg.txt.gz
results/FINAL_H3K79me2_25-75-75-25_TRADITIONAL_1000_sample_max.txt.gz
results/FINAL_H3K79me2_25-75-75-25_CHIPRX_1000_sample_avg.txt.gz
results/FINAL_H3K79me2_25-75-75-25_CHIPRX_1000_sample_max.txt.gz
results/FINAL_H3K79me2_25-75-75-25_TAGREMOVAL_1000_sample_avg.txt.gz
results/FINAL_H3K79me2_25-75-75-25_TAGREMOVAL_1000_sample_max.txt.gz
We have created a UCSC session with all tracks generated by spikChIP using all normalization strategies:
https://genome.ucsc.edu/s/DiCroceLab/spikChIP_NAR%2DGB_2021
We have used multiple ChIP-seq datasets with spike-in from three different publications:
Orlando DA, Chen MW, Brown VE, Solanki S et al. Quantitative ChIP-Seq normalization reveals global modulation of the epigenome. Cell Rep 2014 Nov 6;9(3):1163-70. PMID: 25437568
https://www.ncbi.nlm.nih.gov/pubmed/25437568
[Jurkat_K79_25%_R1]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465005
[Jurkat_K79_75%_R1]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465007
./spikChIP.pl -vcp 1 config/H3K79me2_25-75_config.txt config/ChromInfo.txt 2> logs/H3K79me2_25-75.log;
Orlando DA, Chen MW, Brown VE, Solanki S et al. Quantitative ChIP-Seq normalization reveals global modulation of the epigenome. Cell Rep 2014 Nov 6;9(3):1163-70. PMID: 25437568
https://www.ncbi.nlm.nih.gov/pubmed/25437568
[Jurkat_K79_0%_R1]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465004
[Jurkat_K79_25%_R1]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465005
[Jurkat_K79_50%_R1]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465006
[Jurkat_K79_75%_R1]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465007
[Jurkat_K79_100%_R1]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1465008
./spikChIP.pl -vcp 1 config/H3K79me2_0-25-50-75-100_config.txt config/ChromInfo.txt 2> logs/H3K79me2_0-25-50-75-100.log;
Egan B, Yuan C, Craske ML, Labhart P et al. An Alternative Approach to ChIP-Seq Normalization Enables Detection of Genome-Wide Changes in Histone H3 Lysine 27 Trimethylation upon EZH2 Inhibition. PLoS One. 2016 Nov 22;11(11):e0166438.
https://pubmed.ncbi.nlm.nih.gov/27875550
[PC9_control_H3K27me3_Dmspike]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1890165
[PC9_EZH2inh_H3K27me3_Dmspike]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1890166
./spikChIP.pl -vcp 3 config/H3K27me3_control-EZH2inh_config.txt config/ChromInfo.txt 2> logs/H3K27me3_control-EZH2inh.log;
Guertin MJ, Cullen AE, Markowetz F, Holding AN. Parallel factor ChIP provides essential internal control for quantitative differential ChIP-seq. Nucleic Acids Res 2018 Jul 6;46(12):e75.
https://www.ncbi.nlm.nih.gov/pubmed/29672735
[SLX-8047_1b_ER_none]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2747692
[SLX-8047_1a_ER_Fulvestrant]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2747691
[SLX-8047_2b_ER_none]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2747694
[SLX-8047_2a_ER_Fulvestrant]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2747693
./spikChIP.pl -vcp 0 config/ER_R1R2_config.txt config/ChromInfo.txt 2> logs/ER_R1R2.log;
Users can find in the logs/ folder the whole list of messages provided by spikChIP for processing each dataset.
Quantitative comparison of multiple chromatin immunoprecipitation-sequencing (ChIP-seq) experiments with spikChIP. E. Blanco, C. Ballaré, L. Di Croce, and S. Aranda. Methods in Molecular Biology (2023) 2624: 55-72.
https://pubmed.ncbi.nlm.nih.gov/36723809/
SpikChIP: a novel computational methodology to compare multiple ChIP-seq using spike-in chromatin. Enrique Blanco, Luciano Di Croce, Sergi Aranda. NAR Genom Bioinform. (2021) Jul 27;3(3):lqab064.
https://pubmed.ncbi.nlm.nih.gov/34327329/
Comparative ChIP-seq (Comp-ChIP-seq): a novel computational methodology for genome-wide analysis. Enrique Blanco, Luciano Di Croce, Sergi Aranda. bioRxiv 532622 (2019).