Consolidating cooler reading code

Phlya commented 6 years ago

Hi, so I have been trying to make it work directly from cooler files, and quickly... Merging bins and pixels is a pain in the neck, takes extremely long at any reasonably high resolution. So I am now trying to split both tables by chromosomes first, simultaneously removing inter-chromosomal interactions, and then merging much smaller tables. I got it all to work, but now I have a problem with the resulting hic.list - for some reason it's missing end coordinates, and D:

> head(hic.list[[1]])
   chr1   start1 end1 chr2   start2 end2   IF1   IF2  D           M
1: chr1  4608000   NA chr1  4608000   NA 51427 77939 NA  0.59981942
2: chr1  3584000   NA chr1  4608000   NA  4187  4441 NA  0.08496765
3: chr1  4096000   NA chr1  4608000   NA 18159 18489 NA  0.02598244
4: chr1  3072000   NA chr1  4608000   NA  2744  2830 NA  0.04452157
5: chr1  2560000   NA chr1  4608000   NA   292   345 NA  0.24062799
6: chr1 39936000   NA chr1 50688000   NA   186   142 NA -0.38941169

I am going to be away on the weekend and can't work on this further, but if you could have a look and perhaps let me know what is wrong, I would greatly appreciate it. I doubt this is the most efficient code, I am especially suspicious of very weird mapply output I get which causes me to have to use a loop to get all the data into a normal list... As I said, I normally don't use R, so I am quite proud of this already.

library(HiCcompare)
library(BiocParallel)
library(hdf5r)
library(dplyr)

args = commandArgs(trailingOnly=TRUE)
register(MulticoreParam(workers = 2), default = TRUE)

clr1 <- H5File$new('~/eddie/ilia/coolers/serum-KM.1000.multires.cool', mode='r+')#H5File$new(args[1], mode='r+')

chr <- as.character(clr1[['5/bins/chrom']][])
start <- as.numeric(clr1[['5/bins/start']][])
end <- as.numeric(clr1[['5/bins/end']][])
bins <- data.frame(cbind(chr, start, end))

bin1_id <- as.numeric(clr1[['5/pixels/bin1_id']][])+1
bin2_id <- as.numeric(clr1[['5/pixels/bin2_id']][])+1
IF <- clr1[['5/pixels/count']][]
pxls1 <- data.frame(cbind(bin1_id, bin2_id, IF))

clr1$close_all()

clr2 <- H5File$new('~/eddie/ilia/coolers/2i.1000.multires.cool', mode='r+')#H5File$new(args[2], mode='r+')
bin1_id <- as.numeric(clr2[['5/pixels/bin1_id']][])+1
bin2_id <- as.numeric(clr2[['5/pixels/bin2_id']][])+1
IF <- as.numeric(clr2[['5/pixels/count']][])
pxls2 <- data.frame(cbind(bin1_id, bin2_id, IF))

clr2$close_all()

chrbins <- split(as.numeric(row.names(bins)), bins[,'chr'])
isin <- function(pxls, chrbinvector) {
  start <- chrbinvector[1]
  end <- chrbinvector[length(chrbinvector)]
 data.frame(pxls[between(pxls[,'bin1_id'], start, end) & between(pxls[,'bin2_id'], start, end),])
}

dat1 <- bplapply(chrbins, FUN=isin, pxls=pxls1)
dat2 <- bplapply(chrbins, FUN=isin, pxls=pxls2)

binchrs <- split(bins, bins[,'chr'])

mymerge1 <- function(bins, pxls) {
  data.frame(merge(bins, pxls, by.x='row.names', by.y='bin1_id'))
  }
mymerge2 <- function(bins, pxls) {
  data.frame(merge(bins, pxls, by.x='row.names', by.y='bin2_id', suffixes=c('2', '1')))
  }

### dat1

dat1merged <- mapply(mymerge1, binchrs, dat1)
dat1 <- list()
for(i in 1:(dim(dat1merged)[2])){
  dat1[[i]]<-as.data.frame(dat1merged[,i])}

dat1merged <- mapply(mymerge2, binchrs, dat1)
dat1 <- list()
for(i in 1:(dim(dat1merged)[2])){
  dat1[[i]]<-as.data.frame(dat1merged[,i])[,c("chr1", "start1", "end1", "chr2", "start2", "end2", "IF")]}

dat1 <- lapply(dat1, FUN = function(df){df[order(df$chr1, df$start1, df$start2),]})

### dat2  

dat2merged <- mapply(mymerge1, binchrs, dat2)
dat2 <- list()
for(i in 1:(dim(dat2merged)[2])){
  dat2[[i]]<-as.data.frame(dat2merged[,i])}

dat2merged <- mapply(mymerge2, binchrs, dat2)
dat2 <- list()
for(i in 1:(dim(dat2merged)[2])){
  dat2[[i]]<-as.data.frame(dat2merged[,i])[,c("chr1", "start1", "end1", "chr2", "start2", "end2", "IF")]}

dat2 <- lapply(dat2, FUN = function(df){df[order(df$chr1, df$start1, df$start2),]})

hic.list <- mapply(create.hic.table, dat1, dat2, SIMPLIFY = FALSE, scale=FALSE)
hic.list <- total_sum(hic.list)
hic.list <- hic_loess(hic.list, Plot=FALSE, Plot.smooth=FALSE, parallel=TRUE)
hic.list <- hic_compare(hic.list, A.min = NA, adjust.dist = TRUE, p.method = 'fdr', Plot = TRUE, parallel=TRUE)
hic.list <- do.call(rbind, hic.list)
hic.list <- hic.list[hic.list$p.adj<0.05,]
write.table(hic.list, 'test_output.txt')

Thank you!

jstansfield0 commented 6 years ago

That is strange. Is the data you are using available on any of the public repositories so that I can download it and try to reproduce this?

mdozmorov commented 6 years ago

@jstansfield0, if you can test it on any of our datasets that will be great. I suspect something is happening in the merging functions. We need to incorporate this code in the vignette, so using the data in the package will be best.

Phlya commented 6 years ago

Sorry, this is not publicly available data, but I don't expect this to be any different with any other pair of cooler files.

I checked how the dat1 and dat2 look prior to merging together, and they are fine... so this is very strange.

On Fri, 14 Sep 2018, 17:33 Mikhail Dozmorov, notifications@github.com wrote:

@jstansfield0 https://github.com/jstansfield0, if you can test it on any of our datasets that will be great. I suspect something is happening in the merging functions. We need to incorporate this code in the vignette, so using the data in the package will be best.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dozmorovlab/HiCcompare/issues/9#issuecomment-421414259, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwsujN7pVeMIvhm85VigmX13hk4FKeIks5ua9pxgaJpZM4WphGL .

Phlya commented 6 years ago

OK, I solved this problem, cbind was creating a matrix, and so casting everything to a wrong type...

Phlya commented 6 years ago

Seems like dplyr's join is orders of magnitude faster than base merge. Some minor changes, this is my current code (except the parallel processing of any kind currently crashes my R on the cluster???)

library(BiocParallel)
library(hdf5r)
library(dplyr)
library(tibble)
library(purrr)
library(HiCcompare)

args = commandArgs(trailingOnly=TRUE)
register(MulticoreParam(workers = 3), default = TRUE)

ptm1 <- proc.time()

clr1 <- H5File$new('../coolers/serum-KM.1000.multires.cool', mode='r+')#H5File$new(args[1], mode='r+')#

chr <- as.character(clr1[['9/bins/chrom']][])
start <- as.numeric(clr1[['9/bins/start']][])
end <- as.numeric(clr1[['9/bins/end']][])
bins <- data.frame(chr, start, end)

ptm2 <- proc.time()
print(ptm2-ptm1)
print('Created bins')

bin1_id <- as.numeric(clr1[['9/pixels/bin1_id']][])+1
bin2_id <- as.numeric(clr1[['9/pixels/bin2_id']][])+1
IF <- clr1[['9/pixels/count']][]
pxls1 <- data.frame(bin1_id, bin2_id, IF)

ptm3 <- proc.time()
print(ptm3-ptm2)
print('Got first pixels')

clr1$close_all()

clr2 <- H5File$new('../coolers/2i.1000.multires.cool', mode='r+')#H5File$new(args[2], mode='r+')#
bin1_id <- as.numeric(clr2[['9/pixels/bin1_id']][])+1
bin2_id <- as.numeric(clr2[['9/pixels/bin2_id']][])+1
IF <- as.numeric(clr2[['9/pixels/count']][])
pxls2 <- data.frame(bin1_id, bin2_id, IF)

ptm4 <- proc.time()
print(ptm4-ptm3)
print('Got second pixels')

clr2$close_all()

chrbins <- split(as.numeric(row.names(bins)), bins[,'chr'])
isin <- function(pxls, chrbinvector) {
  start <- chrbinvector[1]
  end <- chrbinvector[length(chrbinvector)]
  lo = which.max(pxls[,'bin1_id']>=start)
  hi = which.max(pxls[,'bin1_id']>end)-1
  if (hi==0) {hi=dim(pxls)[[1]]}
  pxls[seq.int(lo, hi, 1),]
}

binchrs <- split(bins, bins[,'chr'])

mymerge1 <- function(bins, pxls) {
  bins <- rownames_to_column(bins)
  bins$rowname <- as.numeric(bins$rowname)
  data.frame(inner_join(bins, pxls, by=c('rowname'='bin1_id')))
  }
mymerge2 <- function(bins, pxls) {
  bins <- rownames_to_column(bins)
  bins$rowname <- as.numeric(bins$rowname)
  df <- data.frame(inner_join(bins, pxls, by=c('rowname'='bin2_id'), suffix=c('2', '1')))
  df[df$chr1==df$chr2,]
  }

### dat1
dat1 <- bplapply(chrbins, FUN=isin, pxls=pxls1)
dat1merged <- mapply(mymerge1, binchrs, dat1)
dat1 <- list()
for(i in 1:(dim(dat1merged)[2])){
  dat1[[i]]<-as.data.frame(dat1merged[,i])}

dat1merged <- mapply(mymerge2, binchrs, dat1)
dat1 <- list()
for(i in 1:(dim(dat1merged)[2])){
  dat1[[i]]<-as.data.frame(dat1merged[,i])[,c("chr1", "start1", "end1", "chr2", "start2", "end2", "IF")]}

ptm5 <- proc.time()
print(ptm5-ptm4)
print('Merged first pixels')
### dat2  

dat2 <- bplapply(chrbins, FUN=isin, pxls=pxls2)

dat2merged <- mapply(mymerge1, binchrs, dat2)
dat2 <- list()
for(i in 1:(dim(dat2merged)[2])){
  dat2[[i]]<-as.data.frame(dat2merged[,i])}

dat2merged <- mapply(mymerge2, binchrs, dat2)
dat2 <- list()
for(i in 1:(dim(dat2merged)[2])){
  dat2[[i]]<-as.data.frame(dat2merged[,i])[,c("chr1", "start1", "end1", "chr2", "start2", "end2", "IF")]}

ptm6 <- proc.time()
print(ptm6-ptm5)
print('Merged second pixels')
print('Data ready for HiCcompare')

hic.list <- mapply(create.hic.table, dat1, dat2, SIMPLIFY = FALSE, scale=FALSE)

ptm7 <- proc.time()
print(ptm7-ptm6)
print('Created Hi-C tables')

hic.list <- total_sum(hic.list)

ptm8 <- proc.time()
print(ptm8-ptm7)
print('Total sum normalized Hi-C tables')

hic.list <- hic_loess(hic.list, Plot=FALSE, Plot.smooth=FALSE, parallel=TRUE)

ptm9 <- proc.time()
print(ptm9-ptm8)
print('Loess normalized Hi-C tables')

hic.list <- HiCcompare::hic_compare(hic.list, A.min = NA, adjust.dist = TRUE, p.method = 'fdr', Plot = TRUE, parallel=TRUE)

ptm10 <- proc.time()
print(ptm10-ptm9)
print('Done with comparing data')

hic.list <- do.call(rbind, hic.list)

hic.list <- hic.list[hic.list$p.adj<0.05,]

write.table(hic.list, 'test_output.txt')

ptm11 <- proc.time()
print(ptm11-ptm10)
print('All saved')

mdozmorov commented 6 years ago

Thanks, @Phlya, for pasting the code. @jstansfield0 is adapting your code for the vignette and tests it on our cluster. We will look into why parallelization may fail on the cluster.

jstansfield0 commented 6 years ago

When you submit your script to the cluster are you specifying the number of processors to be used? It is possible you are registering more processors than are actually assigned to the job.

Phlya commented 6 years ago

Yes, I specify 4 cores.

I am getting a feeling it just requires a really huge amount of memory for parallel processing, I just submitted it again with 64Gb/per core and I believe merging with a parallel call has worked.

jstansfield0 commented 6 years ago

That could be it. When I run jobs on higher resolution data on our cluster I specify 80 - 100GB of memory for the job and use 16 - 20 processors.

Phlya commented 6 years ago

Is this memory per core you mean, or in total?

jstansfield0 commented 6 years ago

Total memory.

Phlya commented 6 years ago

OK, then I asked for comparable total memory and way more memory per core, and had problems... Anyway, I'll see if this solves the problem.

jstansfield0 commented 6 years ago

The parallelization splits the data up by chromosome. It is just a bplapply() over the list of chromosomes so you should only need as much memory per core as the size of the largest chromosome map. If parallel still won't work I'd just try running it without the parallel option overnight.

jstansfield0 commented 6 years ago

The other option for speeding things up is if you set a value for span when running hic_loess. Something like span = 0.1 should be fine. The automatic calculation for span is what takes the longest.

Phlya commented 6 years ago

Yep, I left it running consecutively when left the lab. Will try setting the span, thanks! Is there a way to guess it for my particular data?

Phlya commented 6 years ago

Ah, I found span in the output of loess. It's set to something more like 0.013 everywhere, not ~0.1, does it mean anything?..

jstansfield0 commented 6 years ago

It shouldn't matter much between 0.013 and 0.1. The span is just how much of the data in the local area is used for fitting the loess curve.

Phlya commented 6 years ago

Maybe should be separate issues, happy to create them, let me know, but:

Plotting doesn't work in parallel mode, the output file is corrupted (also, since there is no way to specify the filename for the plots, multiple running jobs will try to write to the same file)
All my MD plots look really weird! Any thoughts?..

jstansfield0 commented 6 years ago

You can try saving plots to a pdf like this:

pdf(file = 'mdplots.pdf')
hic_compare(hic.list)
dev.off()

Also your MD plot looks pretty typical to me. What resolution is the data you are using?

Phlya commented 6 years ago

OK, thanks, will try that!

Are they are supposed be so stripy?.. They don't look like this in the vigniette. This is from 32 kb resolution.

jstansfield0 commented 6 years ago

They get stripey at higher resolutions. This is just because the range of differences between the datsets tends to be lower, i.e. you get a bunch of the differences being equal to the same value which shows up as the lines. The data in the vignette is 1MB resolution so the range of possible values is larger.

Phlya commented 6 years ago

Aha I see, this does make sense, thanks!

jstansfield0 commented 5 years ago

Added the cooler2bedpe function for reading in .cool files.

dozmorovlab / HiCcompare

Consolidating cooler reading code #9