The aim of this analysis was to identify distinct distributions of interacting contacts within genomic regions. A pair of thresholds is applied to the Z-scores: a minimum threshold to identify sectors that are enriched and a maximum threshold to identify sectors that are not enriched.
Please have 'misha' R package installed prior to running run_domainClassifyR.R, with a GENOME_DB configured (minimum example provided in example_data).
Install domainClassifyR
using devtools::github('ChristopherBarrington/domainClassifyR')
. Tested under Linux R 3.5.1 with:
package * version date
devtools * 1.13.6 2018-06-27
domainClassifyR * 0.0.0.9000 2018-11-02
doMC 1.3.5 2017-12-12
dplyr * 0.7.7 2018-10-16
ggplot2 * 3.1.0 2018-10-25
magrittr 1.5 2014-11-22
misha * 4.0.4 2018-09-24
plyr * 1.8.4 2016-06-08
purrr 0.2.5 2018-05-29
scales 1.0.0 2018-08-09
snow * 0.4-2 2016-10-14
tibble 1.4.2 2018-01-22
tidyselect 0.2.5 2018-10-11
get_contacts()
is used to query the track_nm
score track to select all interacting fends within each 2D interval in domains
. The data are saved into a list and cached in cache.path
to quickly reload the same 2D interval.
domains
A data.frame of 2D intervalstrack_nm
Name of the score track from which fend pairs are extractedcache.path
Absolute path to the directory that cached domains can be found or saved (will be created)All subsequent functions use the list of contacts returned by get_contacts()
as input, specified in each by the contacts
argument.
See ?get_contacts
Once the 2D intervals are cached, they are loaded into a list (one element per 2D interval). Subsequent characterisations of 2D intervals are required to format the data for sampling.
add.domain.features()
appends information to the DOMAIN
element and to the all contacts data.frame: the sizes of the 2 intervals are recorded as well as the position in resolution
bins of the scored fend pair relative to the start and end of the 2D interval. Interactors (high-score fend pairs) with a bin position of 0 would be considered as leading- or trailing-edge).
resolution
The size in bp of the trailing- or leading- edge (and therefore the size of the interval 'corner')get.high.score.fends()
duplicates the ALL
contacts data.frame, selecting only those fend pairs with score above min_score
into the HIGH_SCORE
element.
min_score
The minimum score for which fend pairs are considered interactorscount.contacts()
is used to count the number of contact pairs in the 2D interval (and takes no arguments beyond the contacts
list).
get.domain.position.classification()
calculates the number of contacts (and interactors) in each sector of each 2D interval using the binned positions of contacts from the add.domain.features
function and the resolution
parameter. The function also calculates the number of interactors in a sector as a proportion of all contacts in the 2D interval ($FEND.SCORE.DISTRIBUTION$SCORES_dHS_mALL
) and of the total number of interactors ($FEND.SCORE.DISTRIBUTION$SCORES_dALL_mHS
). This function takes no arguments beyond the contacts
list.
See ?add.domain.features
, ?get.high.score.fends
, ?count.contacts
, ?get.domain.position.classification
The contacts within each 2D interval are then randomly sampled to generate a random distribution of contacts within each sector that can be compared to the observed number of interactors in each sector. get.fend.samples()
is used to randomly select contacts (ignoring score) and count how many of the randomly selected contacts are positioned in each of the four sectors. The number of contacts sampled (using R
sample()
) is determined by the number of interactors. The distribution is generated by repeating the sampling process n
times; a data.frame of n
rows is produced with each column showing the number of randomly selected contacts in a sector ($CONTACTS$RANDOM_SAMPLE_TABLES
). The process can be parallelised using the doMC
package and the cluster_size
parameter to determine the number of parallel threads. For reproducibility a seed can be set, which is determined by the 2D interval ID - expecting an ID of 12:34
the seed would be 1234
.
random_seed
Boolean whether to use the intervalID to set a seedn
number of times to repeat the sampling processcluster_size
number of parallel threads to run or 1
for no parallelisationSee ?get.fend.samples
Using the number of sampled contacts per sector, get_Z_statistics()
calculates the mean and standard deviation of the sector distributions which are used with the number of observed interactors to calculate a Z-score for each sector (of each 2D interval). This is saved into $Z_STATISTICS
. Z-scores are converted to p-values using pnorm
(P
). get_min_fend_pair_domains()
is used to filter the 2D interval set, requiring that an interval pair has at least min
interactors. Multiple testing is then corrected for using get_Q_values()
. The 2D intervals set is collapsed to a data such that each row is a filtered 2D interval and each column is P
for each sector in the 2D interval. Each column is corrected using the method specified by method
which is passed to p.adjust()
.
The enrichment of each sector in 2D regions is calculated independently so each region may have multiple enriched sectors. The classification of the 2D region is the aggregate of the enriched domain sectors. Regions that have a sector classified as ambiguous are excluded from further analysis. Increasing the lower threshold or decreasing the upper threshold would decrease the number of ambiguous sectors identified but could decrease specificity (or homogeneity) of the classified groups. Therefore, a TAD presented with a ‘Corner’ classification (loop domain) shows specific (and exclusive) enrichment of high-scoring contacts in the 60kb interacting region between the TAD borders. TADs with multiple enriched sectors were not considered; in these TADs, the Leading+Corner or Trailing+Corner sectors were most common.
min
Minimum number of contacts or interactors required in a domainmeasure
Should be 'HIGH_SCORE'
or 'ALL'
to determine which contact set is filtered bymethod
The correction method to apply, see ?p.adjust
See ?get_Z_statistics
, ?get_min_fend_pair_domains
, ?get_Q_values
Using example data, script should complete in minutes.
library(misha)
library(domainClassifyR)
# set the misha GENOME_DB root (eg extract the example_data mm10 archive into /absolute/path/to/GENOME_DB)
gsetroot('/absolute/path/to/GENOME_DB/mm10/')
# load data / set parameters
intervals_2d <- gintervals.load("intervs.chr19_domains")
track_nm <- 'hic.example_project.example_dataset.score'
cache_path <- file.path(getwd(), 'domainclassifyr_cache')
# run the parts of the package
contacts <- get_contacts(domains=intervals_2d,
track_nm=track_nm,
cache.path=cache_path,
band=c(10e3,15e6),
get.fends=FALSE)
contacts <- add.domain.features(contacts, resolution=60e3)
contacts <- get.high.score.fends(contacts, min_score=40)
contacts <- count.contacts(contacts)
contacts <- get.domain.position.classifications(contacts)
contacts <- get.fend.samples(contacts, n=1000)
contacts <- get_Z_statistics(contacts)
contacts <- get_min_fend_pair_domains(contacts, min=100)
contacts <- get_Q_values(contacts)
# get a data.frame of qvalues for each sector of the domain
> ldply(contacts, function(d) d$Z_STATISTICS$OBSERVED$Q) %>% tibble::as.tibble()
# # A tibble: 77 x 5
# .id CORNER FORWARD REVERSE OTHER
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2012:2012 1.00e+ 0 1.48e- 4 0. 1.00e+ 0
# 2 2013:2013 1.00e+ 0 1.00e+ 0 1.00e+ 0 3.30e-21
# 3 2015:2015 1.00e+ 0 4.75e-65 1.00e+ 0 1.00e+ 0
# 4 2017:2017 1.32e-52 1.00e+ 0 8.29e- 2 8.92e- 1
# 5 2019:2019 1.27e- 8 1.00e+ 0 9.79e-209 1.00e+ 0
# 6 2021:2021 4.54e-14 3.54e- 3 1.00e+ 0 1.00e+ 0
# 7 2022:2022 1.00e+ 0 1.00e+ 0 1.00e+ 0 6.35e-60
# 8 2023:2023 1.00e+ 0 1.00e+ 0 1.00e+ 0 1.33e-11
# 9 2024:2024 4.34e-62 1.00e+ 0 1.15e-186 1.00e+ 0
# 10 2026:2026 5.88e-55 2.54e-25 1.00e+ 0 1.00e+ 0
# # ... with 67 more rows
# plot the contacts in one domain
ggplot2::ggplot(contacts[['2106:2106']]$CONTACTS$HIGH_SCORE)+
aes(x=FEND.X,y=FEND.Y,colour=SCORE)+
geom_point(shape=20)+coord_fixed()+theme(aspect.ratio=1)
This should produce not produce any output except for progress bars (unless there is an error).
> contacts <- get_contacts(domains=intervals_2d,
+ track_nm=track_nm,
+ cache.path=cache_path,
+ band=c(10e3,15e6),
+ get.fends=FALSE)
[get.contacts] Getting contact matrices for 85 domains
[get.contacts] Saving new domains to cache [n=85]
|======================================================================| 100%
[get.contacts] Preallocating a 85 element list
[get.contacts] Loading matrices from cache (/home/local/domainclassifyr_cache)
|======================================================================| 100%
> contacts <- add.domain.features(contacts, resolution=60e3)
[add.domain.features] Adding domain features
|======================================================================| 100%
> contacts <- get.high.score.fends(contacts, min_score=40)
[get.high.score.fends] Identifying high-scoring fends [>=40]
|======================================================================| 100%
> contacts <- count.contacts(contacts)
[count.contacts] Counting contacts
|======================================================================| 100%
> contacts <- get.domain.position.classifications(contacts)
Getting distribution of fend positions
|======================================================================| 100%
> contacts <- get.fend.samples(contacts, n=1000)
[get.fend.samples] Randomly selecting contacts from ALL [x 1,000]
> contacts <- get_Z_statistics(contacts)
[get_Z_statistics] Calculating Z-statistics and p-values
|======================================================================| 100%
> contacts <- get_min_fend_pair_domains(contacts, min=100)
[get_min_fend_pair_domains] Selecting domains with at least 100 fend pairs in HIGH_SCORE
> contacts <- get_Q_values(contacts)
[get_Q_values] Correcting for multiple testing
>
Shown is the structure for the first TAD in the contacts
list of 2D intervals.
> str(contacts[[1]])
List of 6
$ DOMAIN :'data.frame': 1 obs. of 13 variables:
..$ chrom1 : Factor w/ 1 level "chr19": 1
..$ start1 : num 3172500
..$ end1 : num 4974499
..$ chrom2 : Factor w/ 1 level "chr19": 1
..$ start2 : num 3172500
..$ end2 : num 4974499
..$ intervalID : Factor w/ 1 level "2012:2012": 1
..$ intervalID1: num 2012
..$ SIZE1 : num 1801999
..$ intervalID2: num 2012
..$ SIZE2 : num 1801999
..$ L1 : Factor w/ 1 level "SELF": 1
..$ AREA : num 3.25e+12
$ CONTACTS :List of 3
..$ ALL :'data.frame': 88394 obs. of 7 variables:
.. ..$ FEND.X : num [1:88394] 3669732 3671461 3665714 3671461 3663092 ...
.. ..$ FEND.Y : num [1:88394] 4965109 4965109 4973544 4973544 4965622 ...
.. ..$ SCORE : num [1:88394] -13.2 -13.7 -16.8 -12.7 -14.3 ...
.. ..$ GROUP.X : num [1:88394] 8 8 8 8 8 8 8 8 8 8 ...
.. ..$ GROUP.Y : num [1:88394] 0 0 0 0 0 0 0 0 0 0 ...
.. ..$ FEND.DISTANCE: num [1:88394] 1295377 1293648 1307830 1302083 1302530 ...
.. ..$ POSITION : Factor w/ 4 levels "CORNER","FORWARD",..: 3 3 3 3 3 3 3 3 3 3 ...
..$ HIGH_SCORE :'data.frame': 3113 obs. of 7 variables:
.. ..$ FEND.X : num [1:3113] 3796255 3734437 3734436 3734436 3734437 ...
.. ..$ FEND.Y : num [1:3113] 3980935 3919813 3919812 3904170 3904170 ...
.. ..$ SCORE : num [1:3113] 40.1 40.1 40.1 40.6 40.6 ...
.. ..$ GROUP.X : num [1:3113] 10 9 9 9 9 9 9 9 9 9 ...
.. ..$ GROUP.Y : num [1:3113] 16 17 17 17 17 17 17 17 17 17 ...
.. ..$ FEND.DISTANCE: num [1:3113] 184680 185376 185376 169734 169733 ...
.. ..$ POSITION : Factor w/ 4 levels "CORNER","FORWARD",..: 4 4 4 4 4 4 4 4 4 4 ...
..$ RANDOM_SAMPLE_TABLES:List of 1
.. ..$ OBSERVED: int [1:1000, 1:4] 0 0 0 0 0 1 0 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
$ CACHE.FILE : chr "/home/local/domainclassifyr_cache/1c7f7c30abbde91764ed70e8ddc14d62.RData"
$ N_CONTACTS :List of 2
..$ ALL : int 88394
..$ HIGH_SCORE: int 3113
$ FEND.SCORE.DISTRIBUTION:List of 7
..$ ALL : 'table' num [1:4(1d)] 3 645 4697 83049
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
..$ HIGH_SCORE : 'table' num [1:4(1d)] 0 41 1196 1876
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
..$ SCORES_dALL_mHS: 'table' num [1:4(1d)] 0 197.9 792.7 70.3
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
..$ SCORES_dHS_mALL: 'table' num [1:4(1d)] 0 8.5 1804.6 50048.2
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
..$ SCORES_dALL : 'table' num [1:4(1d)] 0 0.0636 0.2546 0.0226
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
..$ SCORES_dHS : 'table' num [1:4(1d)] 0 0.0132 0.3842 0.6026
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
..$ SCORES : 'table' num [1:4(1d)] 0 0.0132 0.3842 0.6026
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
$ Z_STATISTICS :List of 1
..$ OBSERVED:List of 7
.. ..$ MEAN: Named num [1:4] 0.105 22.711 165.189 2924.995
.. .. ..- attr(*, "names")= chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
.. ..$ SDEV: Named num [1:4] 0.313 4.776 12.401 13.192
.. .. ..- attr(*, "names")= chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
.. ..$ OBS : 'table' num [1:4(1d)] 0 41 1196 1876
.. .. ..- attr(*, "dimnames")=List of 1
.. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
.. ..$ Z : 'table' num [1:4(1d)] -0.335 3.829 83.125 -79.52
.. .. ..- attr(*, "dimnames")=List of 1
.. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
.. ..$ P : 'table' num [1:4(1d)] 6.31e-01 6.43e-05 0.00 1.00
.. .. ..- attr(*, "dimnames")=List of 1
.. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
.. ..$ p : 'table' num [1:4(1d)] 6.31e-01 6.43e-05 0.00 1.00
.. .. ..- attr(*, "dimnames")=List of 1
.. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
.. ..$ Q : Named num [1:4] 1 0.000148 0 1
.. .. ..- attr(*, "names")= chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
>```