Working with cooler files

Phlya commented 6 years ago

Hello,

I am trying to use HiCCompare with cooler files. Is it right that you are suggesting to dump the pixels from the cooler (basically, a 7-column bedpe file), but then read all of them into memory to convert into an R sparse matrix? The problem is that this is very slow for high resolutions (e.g. 10 kb), and also takes up a lot of memory... Is it impossible to just read the data directly from the cooler file, which is just an hdf5 container, for each chromosome while running normalization and calling of differential interactions, so that the whole genome is never stored in memory?

Thanks, Ilya

jstansfield0 commented 6 years ago

Hi Ilya,

Yes, you are correct that you need to read the matrices into memory. We have not implemented any way to read the data directly from cooler files without storing it in memory. I would suggest using a computing cluster for the analysis you wish to perform if you have one available. Or you can perform your analysis one chromosome at a time and thus only need to read into memory the data for one chromosome.

Best, John

mdozmorov commented 6 years ago

It doesn't mean we can't. Let's keep this issue open and look into implementing direct hdf5 handling. Thanks, @Phlya, for the suggestion.

Phlya commented 6 years ago

I see, thanks!

Phlya commented 6 years ago

With rhdf5 it was easier to load the whole table than hdf5r - works in one line for a whole bins/pixels dataframe instead of a line per column and then cbinding - but I had weird problems with it on the cluster... But here it goes, loading data directly from cooler files without dumping first. Here it's hardcoded for a multiresolution cooler and zoom level 10, but easy to adapt or make more flexible, especially you know R better than me...

library(HiCcompare)
library(BiocParallel)
library(hdf5r)

args = commandArgs(trailingOnly=TRUE)

clr1 <- H5File$new(args[1], mode='r+')

chr <- as.character(clr1[['10/bins/chrom']][])
start <- clr1[['10/bins/start']][]
end <- clr1[['10/bins/end']][]
bins <- cbind(chr, start, end)

bin1_id <- clr1[['10/pixels/bin1_id']][]
bin2_id <- clr1[['10/pixels/bin2_id']][]
count <- clr1[['10/pixels/count']][]
pxls1 <- cbind(bin1_id, bin2_id, count)

clr1$close_all()

clr2 <- H5File$new(args[2], mode='r+')
bin1_id <- clr2[['10/pixels/bin1_id']][]
bin2_id <- clr2[['10/pixels/bin2_id']][]
count <- clr2[['10/pixels/count']][]
pxls2 <- cbind(bin1_id, bin2_id, count)

clr2$close_all()

dat1 <- merge(bins, pxls1, by.x=0, by.y='bin1_id')
dat1 <- merge(dat1, bins, by.y=0, by.x='bin2_id', suffixes=c('1', '2'))
dat1 <- dat1[c("chrom1", "start1", "end1", "chrom2", "start2", "end2", "count"), ]
colnames(dat1) <- c("chr1", "start1", "end1", "chr2", "start2", "end2", "IF")

dat2 <- merge(bins, pxls1, by.x=0, by.y='bin1_id')
dat2 <- merge(dat2, bins, by.y=0, by.x='bin2_id', suffixes=c('1', '2'))
dat2 <- dat2[c("chrom1", "start1", "end1", "chrom2", "start2", "end2", "count"), ]
colnames(dat2) <- c("chr1", "start1", "end1", "chr2", "start2", "end2", "IF")

dat1 <- dat1[dat1$chr1==dat1$chr2, ]
dat2 <- dat2[dat2$chr1==dat2$chr2, ]

dat1 <- split(dat1, dat1$chr1)
dat2 <- split(dat2, dat2$chr1)

hic.list <- mapply(create.hic.table, dat1, dat2, SIMPLIFY = FALSE, scale=FALSE)

hic.list <- total_sum(hic.list)

register(MulticoreParam(workers = 7), default = TRUE)

hic.list <- hic_loess(hic.list, Plot=FALSE, Plot.smooth=FALSE, parallel=TRUE)
hic.list <- hic_compare(hic.list, A.min = NA, adjust.dist = TRUE, p.method = 'fdr', Plot = TRUE, parallel=TRUE)

hic.list <- do.call(rbind, hic.list)

hic.list <- hic.list[hic.list$p.adj<0.05,]

write.table(hic.list, args[3])

Still waiting for the script to run, but it clearly hasn't broken during data loading since it's been running a while! :)

mdozmorov commented 6 years ago

That is fantastic, thanks, @Phlya! We will incorporate you code in the documentation.

Phlya commented 6 years ago

TBH this is still very slow, pretty sure merging pixels with bins is the bottleneck...

dozmorovlab / HiCcompare

Working with cooler files #7