#
#
############### Part 0: Setup #################### This folder "ChromDiff" contains the code and data for running ChromDiff on the examples provided in the manuscript.
Part 0.1: Dependencies for ChromDiff include BedTools, awk, bash, and R. Please make sure these are installed and working before trying to run ChromDiff.
Part 0.2a: Applying ChromDiff to Epigenomics Roadmap data, as described in Yen and Kellis, 2015: The only data that was not included in this zipped file is the 15-state ChromHMM state calls. The easiest way to download all these chromatin state calls is to download the compressed .tar.gz directory at http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/all.mnemonics.bedFiles.tgz. Then, uncompress the directory and move all the contained files into the ChromDiff subdirectory statecalls/core/, which has already been created within this directory. Alternatively, you can download all state call files of the form E***_15_coreMarksmnemonics.bed.gz from http://www.broadinstitute.org/~anshul/projects/roadmap/chromhmmSegmentations/ChmmModels/coreMarks/parallel/set2/final/ into the subdirectory statecalls/core/, which has already been created within this directory. Generally, we are using the "Mnemonics bed files" of the "Core 15-state model described here: http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state
Once the state calls have been downloaded an dmoved into the corresponding directory, simply run ./notes_v2.sh to reproduce the analysis, results, and figures. (NOTE: as described in Yen and Kellis, 2015, our gene list does not include chrY genes. Therefore, if you download our gene list in data/gencode_genes_full.txt, this will be a list of all Version 10 GENCODE protein-coding genes, except for ChrY genes.) All input files can be found in the data/ folder, which can be used as examples as described in Part 1 of this README file. --> To find results and figures, please see Part 4 of this README file.
Part 0.2b: Applying ChromDiff to other data:
########### PART 1: Formatting for other data for ChromDiff input ############# ChromDiff relies on the following information
See below for details:
That is, the first column is the unique ID, the 3rd column is the hex coding of the color for the epigenome, and the fourth column is the name (label that will be used on plots). The second column is the path to the chromatin state annotation file (as described in "3. Chromatin state annotations" below). The remaining columns are any other categories you will want to compare the data based on. The first row must be the "column names" for this matrix, with "ID", "filepath", and "name" as the first three column names, as shown above. NOTE: no periods (.) should be used in the metadata as epigenome IDs, category names, or category values. That means no periods should be used in 1st column, or in any columns after the 3rd column. (Periods may and will likely be used in the filepath column.) NOTE: every category name should be unique from every other category name. (i.e. the 5th and 6th column names can not be the same.) NOTE: Every option value should be uniquely assigned to one category. (For example, there can't be a "TRUE" for category "is_female" and also a "TRUE" entry for category "is_tissue". This can be avoided by using values such as "TRUEFEMALE" and "TRUETISSUE", or simply "FEMALE" and "TISSUE".)
Example: final_celltype_metadata.txt
chr1 34553 36081 - ENSG00000237613.2 FAM138A chr1 69090 70008 + ENSG00000186092.4 OR4F5 chr1 367639 368634 + ENSG00000235249.1 OR4F29
The 6th column of genesymbols are optional, but are necessary for gene set
enrichment calculations.
Example: genes/gencode_genes_full.txt
NOTE: the 4th column must be of the form STATENUM_STATEMNEMONIC, and these must match the numbers/mnemonics given in the chromatin state metadata file (Part1.3b).
b) Chromatin state metadata A tab-delimited file listing all possible chromatin states with the following information: STATE NO. MNEMONIC DESCRIPTION COLOR NAME COLOR CODE 1 TssA Active TSS Red "255,0,0" 2 TssAFlnk Flanking Active TSS Orange Red "255,69,0" 3 TxFlnk Transcr. at gene 5' and 3' LimeGreen "50,205,50" 4 Tx Strong transcription Green "0,128,0" 5 TxWk Weak transcription DarkGreen "0,100,0" 6 EnhG Genic enhancers GreenYellow "194,225,5" 7 Enh Enhancers Yellow "255,255,0" 8 ZNF/Rpts ZNF genes & repeats Medium Aquamarine "102,205,170" 9 Het Heterochromatin PaleTurquoise "138,145,208" 10 TssBiv Bivalent/Poised TSS IndianRed "205,92,92" 11 BivFlnk Flanking Bivalent TSS/Enh DarkSalmon "233,150,122" 12 EnhBiv Bivalent Enhancer DarkKhaki "189,183,107" 13 ReprPC Repressed PolyComb Silver "128,128,128" 14 ReprPCWk Weak Repressed PolyComb Gainsboro "192,192,192" 15 Quies Quiescent/Low White "255,255,255"
NOTE: State numbers must go from 1...n, where n is the total number of states.
Example: core_annotation.txt
Example: rnaseq/57epigenomes.RPKM.pc
5. a) Covariate matrix for factors that will be corrected for. Values can be factors (which will be automatically converted by R), or numbers for "mixed" covariates: UCSD BI UCSF-UBC UW FEMALE MALE UNKNOWN SOLID_LIQUID TYPE E017 1 0 0 0 1 0 0 NONE CELLLINE E002 0 1 0 0 1 0 0 NONE PRIMARYCULTURE E008 1 0 0 0 1 0 0 NONE PRIMARYCULTURE E001 0 1 0 0 1 0 0 NONE PRIMARYCULTURE E015 0 1 0 0 1 0 0 NONE PRIMARYCULTURE
Example: cov.mat.txt
b) Mapping metadata matrix to covariate matrix. This must explain which metadata columns (i.e. "category1", "category2", ...) correspond to which covariate columns. Matrix has n rows and m columns, where n is the number of the metadata columns (categories) that we may correct for, while m is the number of covariate columns. Each matrix cell is TRUE for elements matrix[i,j] where metadata column i corresponds to covariate column j, and FALSE otherwise: UCSD BI UCSF.UBC UW FEMALE MALE UNKNOWN SOLID_LIQUID TYPE LAB TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE SOLID_LIQUID FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TYPE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE SEX FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
Example: map_vars_covariates.txt
######### PART 2: Specifying input files and variables ################### ChromDiff requires that input files and variables be set in the file notes_v2.sh. The file currently contains an outline of the pipeline, with inputs currently set to the example files and parameters that generated the sex comparison (Female vs Male) described in Yen and Kellis, 2015. The description of the required inputs is described below, in Part2a-Part 2e.
These required inputs for ChromDiff can roughly be divided into two parts, as shown by "Part 1" and "Part 2" in the notes_v2.sh. Part 1, which is referenced in sections a-b below, has to do with processing the data and generating the ChromDiff representation. Part 2, which is referenced in sections c-e, has to do with the parameters for each epigenomic group comparison (such as Female/Male, Adult/Fetal, etc.).
a. Labels: In notes_v2.sh, you need to specify 2 names/labels. Specify statecalls_label to any name or label that describes your epigenomic (chromatin state call) information. Set generegions_label to any name or label that describes your gene regions. These labels are so that you can run the ChromDiff pipeline on different chromatin state calls or different epigenomic information, as well as different gene regions.
Example (as shown in "Part 1" of notes_v2.sh): statecalls_label="core" generegions_label="gencode_v10"
b. File inputs: In notes_v2.sh, you also need to specify the following file inputs. They must be formatted as described in the corresponding section of Part 1 above:
i. metadatafile: epigenomic metadata file as described in Part 1.1
ii. genefile: file of gene regions as described in Part 1.2
iii. state_annotations_file: metadata for chromatin states, as described in
Part 1.3b.
iv. expfile: expression matrix, as specified in Part 1.4.
v. covariate_mat_file: Covariate matrix specifying the covariates and
associated epigenomes, as described in Part 1.5a
vi. map_covariates_file: Matrix specifying relationships between
covariates (in the covariate matrix) and metadata categories/properties,
as specified in 1.5b
Example (as shown in "Part 1" of notes_v2.sh): metadatafile="data/final_celltype_metadata.txt" genefile="data/gencode_genes_full.txt" states_info="data/core_annotation.txt" expfile="data/57epigenomes.RPKM.pc" covariate_mat_file="data/cov.mat.txt" map_covariates_file="data/map_vars_covariates.txt"
c. Epigenomic groups: To compare groups of epigenomes to each other, you need to specify the property and options you want to compare Specifically, the property is the column name (category) in your epigenome metadata matrix (i.e. "sex"). The two options are the two (different) values for that column you want to compare your epigenomes based on (i.e. "Female" vs "Male"). Then, ChromDiff will compare the group of epigenomes labeled a_option under property to the group of epigenomes labeled b_option under property.
Example (as shown in "Part 2" of notes_v2.sh): property="sex" a_option="Female" b_option="Male"
d. ChromDiff parameters: ChromDiff allows two parameters to be set. i. test_type: The test_type variable specifies the probabilistic test ChromDiff uses to calculate p-values, as described in Yen and Kellis, 2015. Options are "wilcox", "ttest", and "ftest", for the Mann-Whitney-Wilcoxon test, the Student's t-test, and the F-test of the equality of two variances, respectively.
ii. correction: The correction variable specifies the multiple hypothesis
correction used on the raw p-values, as described in Yen and Kellis, 2015.
Options are "fdr", "bonferroni", and "BY", for the Benjamini-Hochberg FDR
correction, the Bonferroni (familywise error rate) correction, and the
Benjamini-Yekutieli FDR correction, respectively.
Example (as shown in "Part 2" of notes_v2.sh):
test_type="wilcox"
correction="fdr"
e. Optional setting of dendrogram height cutoff for clustering: heightcutoff can default to 0, which will produce no clusters, unless there are features/genes with identical epigenomic information.
Optionally, after running the analysis the first time (with default height cutoff of 0), you can then look at the resulting dendrogram file in plots/${curr_label}/${a_option}.${b_option}/perc/${test_type}/${correction}/hclust_domstate.pdf to choose a manual cutoff. Then, the height cutoff will generate and analyze gene clusters based on the given dendrogram and height cutoff, as shown in clustered plots and analysis in Yen and Kellis, 2015.
Example (as shown in "Part 2" of notes_v2.sh): heightcutoff=0
After setting up your data for input (as in part 1) and specifying your input files, variables and parameter (as in part 2), we can proceed to actually running ChromDiff on your data.
The simple way to run ChromDiff, is simply to run the script notes_v2.sh.
There, you can see that ChromDiff splits up into 4 steps:
Step 1: Setting up gene info: ./process_gene_info.sh $genefile $generegions_label
Step 2: Calculate features for all epigenomes, chromatin states, and genes: ./calculate_all_raw_features.sh $metadatafile $statecalls_label $generegions_label $states_info
-Note: You can also use ./calculate_raw_features.sh on each epid separately to
parallelize: ./calculate_raw_features.sh $epid $statecallfile $statecalls_label $generegions_label $states_info
Step 3: Calculate background information, feature names, and check data: ./featnames_bgvals.sh $statecalls_label $generegions_label $states_info
Step 4: Run ChromDiff on a particular epigenomic group comparison: ./perform_analysis.R $property $a_option $b_option $test_type $correction $curr_label $heighcutoff $metadatafile
RESULTS:
figures and plots as presented in Yen and Kellis, 2015 will output in the directory: plots/core/${label1}_${label2}/perc/${test}/${correction}/
results for comparison of expression can be found in the directory exp.pvals/
results for msigdb gene set enrichment can be found in msigdb/results/
There are 4 categories of results that ChromDiff generates, which are listed below with corresponding directories:
P-values of features (all_pvals/)
Plots and figures (plots/)
Differential expression results (exp.pvals/)
Gene set enrichment results (msigdb/)
P-values: a. Corrected p-values for every feature (gene and chromatin state) for each comparison can be found in the following path: --> all_pvals/${statecallslabel}${generegions_label}/perc.${test}.${correction}.${property}.${label1}.${label2}.txt All ${xxxxx} variables should be substituted for the values used, as specified in Part 2.
b. Corrected p-values for significant features that were used in figure generation (sampled as necessary) can be found in the following path: sig_pvals/${statecallslabel}${generegions_label}/perc.${test}.${correction}.${property}.${label1}.${label2}.txt
The format of the p-value files is two columns, with the left column as the feature id (GENEID_CHROMATINSTATENUMBER) and the right column as the p-value after multiple hypothesis correction.
a. Plot for combinations of genes and chromatin states (as shown in Fig. 2a): sig.pvals.pdf
b. Dendrogram that can be used to choose a height cutoff (as chosen in Table
S1): hclust_domstate.pdf
c. Plot of dominant chromatin state (as shown in Fig. 2c): sig_maj_plot_clust_matched_domstate.pdf
d. Plot of expression (as shown in Fig. 2d): sig_exp_plot_clust_matched_domstate.pdf
(Supplementary) e. Plot of dominant chromatin state ordered to match gene ordering from
combinations of genes and chromatin states (as shown in Fig. S2b):
sig_maj_plot_clust_matched_combinations.pdf
Differential expression: Statistics that show how many genes were differentially expressed are listed in the file exp.pvals/${statecallslabel}${generegions_label}_domstate_percent_all.txt (Tables are also available in the same directory.)
Gene set enrichment: Results for gene set enrichment from MSigDB gene sets are in the following directory: msigdb/results/c2_c5/symbols/perc.${test}.${correction}.${property}.${label1}.${label2}_all.txt
The file is in table format with the most significant gene sets listed first. As shown in the header, the columns represent: geneset genesetsize(K) numoverlaps(k) k/K q-value orig_pval
P-values are generated using the hyper-geometric test, and Storey's q-value correction is used to generate the q-values shown in the 4th column.