deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

Use npz file I generated #50

Closed apaytuvi closed 7 years ago

apaytuvi commented 7 years ago

I have a matrix I generated in npz. I've realized I need to have besides of matrix, chrNameList, startList, endList, and extraList. What's inside them?

Thank you.

fidelram commented 7 years ago

either the .npz or .h5 format store the matrix plus the bins (chrom, start, end). Each index in the chrNameList, startList, endList, and extraList correspond to the bin index in the matrix.

In extraList I usually put the bin read coverage. extraList is only used under certain circumstances by hicCorrect to discard bins containing repetitive regions. It is safe to replace it by a vector of 0s or any other numeric value.

Also in the .npz or .h5 format the correction factors used for the hic iterative correction are saved and a vector containing the indices of all bins in the matrix that were filtered during the correction.

apaytuvi commented 7 years ago

@fidelram Thank you. I store my matrix this way:

np.savez("HepG2_150000.npz",matrix=matrix, chrNameList=chromosomes, startList=starts, endList=ends, extraList=extra)

being matrix:

array([[  34,   55,    0, ...,    0,    0,    0],
       [  55,  282,    0, ...,    0,    0,    0],
       [   0,    0,    0, ...,    0,    0,    0],
       ..., 
       [   0,    0,    0, ...,    0,    0,    0],
       [   0,    0,    0, ...,    0,    0,    0],
       [   0,    0,    0, ...,    0,    0, 2960]], dtype=int16)

being chromosomes:

array(['chr1', 'chr1', 'chr1', ..., 'chrY', 'chrY', 'chrM'], 
      dtype='|S5')

being starts:

array([       0,   150000,   300000, ..., 59100000, 59250000,        0])

being ends:

array([  150000,   300000,   450000, ..., 59250000, 59373566,    16571])

being extra:

array([    1,     2,     3, ..., 20650, 20651, 20652])

And when I try to use this npz file with HiCExplorer, it always fails. I've reported a problem with hicPlotMatrix, but also with hicCorrectMatrix:

hicCorrectMatrix diagnostic_plot -m HepG2_150000.npz -o diagnostic150000.png

Traceback (most recent call last):
  File "/usr/bin/hicCorrectMatrix", line 5, in <module>
    pkg_resources.run_script('HiCExplorer==1.3', 'hicCorrectMatrix')
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 540, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1455, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/lib/python2.7/site-packages/HiCExplorer-1.3-py2.7.egg/EGG-INFO/scripts/hicCorrectMatrix", line 7, in <module>
    main()
  File "/usr/lib/python2.7/site-packages/HiCExplorer-1.3-py2.7.egg/hicexplorer/hicCorrectMatrix.py", line 598, in main
    plot_total_contact_dist(ma.matrix, args)
  File "/usr/lib/python2.7/site-packages/HiCExplorer-1.3-py2.7.egg/hicexplorer/hicCorrectMatrix.py", line 476, in plot_total_contact_dist
    hic_ma.data[np.isnan(hic_ma.data)] = 0
TypeError: bad argument type for built-in operation
fidelram commented 7 years ago

I think the problem is caused because you are not using a sparse matrix. Simple add:

from scipy.sparse import csr_matrix matrix = csr_matrix(matrix)

Then you can save it. This is how I save the matrix:

np.savez( filename, matrix=matrix, chrNameList=chromosomes, startList=starts, endList=ends, extraList=extra, nan_bins=save(np.array([]), correction_factors=None)

apaytuvi commented 7 years ago

@fidelram Thank you, now it works!