deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
223 stars 68 forks source link

how to plot a hic matrix starting from a .hic file? #886

Closed AlcaArctica closed 5 months ago

AlcaArctica commented 6 months ago

I would like to use the command line functionality of hicexplorer to automatically plot hic matrices as part of my workflow. I have generated .hic files that can be loaded by Juicebox for manual editing with juicer pre as described here: https://github.com/c-zhou/yahs

I understand that I need to convert these first to the .h5 format. So I am running:

hicConvertFormat -m file.hic -o outfile.cool --inputFormat hic --outputFormat cool --resolutions 5000

Which prints to the terminal:

INFO:hicexplorer.hicConvertFormat:Converting with hic2cool.
##########################
### hic2cool / convert ###
##########################
### Header info from hic
... Chromosomes:  ['ALL', 'assembly']
... Resolutions:  [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000]
... Normalizations:  ['VC', 'VC_SQRT', 'KR']
... Genome:  /dev/fd/63
### Converting
... Resolution 5000 took: 51.72279977798462 seconds.
### Finished! Output written to: outfile_5000.cool
... This file is single resolution and NOT higlass compatible. Run with `-r 0` for multi-resolution.

Note: I know that the ALL chromosome is a zoomed out all-by-all view that is used for Juicebox visualization. The remaining chromosomes are the actual ones used for any analysis.

followed by:

hicConvertFormat -m outfile_5000.cool -o outfile.h5 --inputFormat cool --outputFormat h5

Now, I should be ready to plot my matrix with the command:

hicPlotMatrix -m outfile.h5 -o plot.png

Which starts running with:

INFO:hicexplorer.hicPlotMatrix:Cooler or no cooler: False
INFO:hicexplorer.hicPlotMatrix:min: 1, max: 805

But after a while gets me an error regarding insufficient memory:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 18.0 GiB for an array with shape (49109, 49110) and data type float64

I have also tried to directly plot the outfile_5000.cool:

hicPlotMatrix -m outfile_5000.cool -o plot.png

Which starts with:

INFO:hicexplorer.hicPlotMatrix:Cooler or no cooler: True

But then gives a similar error:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 18.0 GiB for an array with shape (49110, 49110) and data type float64

I also tried: hicPlotMatrix -m outfile_5000.cool -o plot.png --region assembly:0-245540815 but same problem.

What is going on? It is a very small genome (150 Mpb), I am already using the lowest resolution and I have sufficient computer resources available. When I visualise my initial file.hic in juicebox I get a very nice matrix with default settings, as can be seen here:

image

I am using the following versions:

hicexplorer 3.7.3
Python 3.10.13
yahs 1.2
Juicer Tools Version 1.9.9

I have installed hicexplorer as recommeded with conda in a seperate env.

BTW, when using hicinfo on my two input files I get:

hicInfo --matrices outfile.h5
# Matrix information file. Created with HiCExplorer's hicInfo version 3.7.3
File:   outfile.h5
Size:   49,109
Bin_length: 5000
Sum of matrix:  23603167.0
Chromosomes:length: assembly: 245540815 bp; 
Non-zero elements:  26,420,568
Minimum (non zero): 1
Maximum:    805
NaN bins:   1289
hicInfo --matrices outfile_5000.cool
# Matrix information file. Created with HiCExplorer's hicInfo version 3.7.3
File:   outfile_5000.cool
Date:   2024-01-16T19:07:19.238878
Genome assembly:    /dev/fd/63
Size:   49,109
Bin_length: 5000
Chromosomes:length: assembly: 245540815 bp; 
Number of chromosomes:  1
Non-zero elements:  13,233,643
The following columns are available: ['chrom' 'start' 'end' 'KR' 'VC' 'VC_SQRT']
Generated by:   hic2cool-0.8.3
joachimwolff commented 5 months ago

Hi,

you use with 5k the highest possible resolution. I recommend to decrease the resolution for a larger area plot, and to use a high resolution only for small regions of maybe 1 to 2 Mb of range.

What happens here is that your computer does not have enough memory. HiCExplorer needs 18 Gb to plot your selected data range, and your computer has less.

Hope that helps,

Joachim

AlcaArctica commented 5 months ago

@joachimwolff Yes, I realised this later as well. Somehow I thought the smaller the number the lower the resolution i.e. the easier to compute (like with image resolution). My bad. Thanks for commenting. (And its working fine now, with a larger resolution number :)

joachimwolff commented 5 months ago

The smaller the number, the smaller the bin size. And therefore, the resolution is higher. It is a bit confusing sometimes.