deeptools / HiCMatrix

GNU General Public License v3.0
11 stars 8 forks source link

using numpy.string_ for info dict generation prevents generated cooler files from usage with HiGlass #29

Closed dmalzl closed 4 years ago

dmalzl commented 4 years ago

I am using HiCexplorer for downstream analysis of our Hi-C data. To view the results I wanted to use HiGlass which uses the multicooler format to view the data at different resolutions. The HiCexplorer provides a nice utility for conversion between the h5 and cooler format. However, when trying to view the results in HiGlass i get an import error. In brief, my script does the following:

hicConnvertMatrix -m sample_100kb.h5 --inputFormat h5 --outputFormat cool -o sample_100kb.cool
cooler zoomify -r 500000,1000000 -o sample.mcool sample_100kb.cool

After investigating this issue I found that the way the conversion is implemented by the hicmatrix library contains a bug. In particular, what HiGlass seems to try when importing a dataset is reading in the metadata of the cooler containers in the multicooler as JSON. While this is fine for the coarser resolutions generated with cooler zoomify the metadata of the cool file generated with hicConvertMatrix cannot be read since it contains binary objects. A quick check with cooler attrs gives:

'@attrs':
      bin-size: 100000
      bin-type: fixed
      creation-date: 2020-04-09 10:42:50.675971
      format: !!binary |
        SERGNTo6Q29vbGVy
      format-url: !!binary |
        aHR0cHM6Ly9naXRodWIuY29tL21pcm55bGFiL2Nvb2xlcg==
      format-version: 3
      generated-by: !!binary |
        SGlDTWF0cml4LTEx
      generated-by-cooler-lib: !!binary |
        Y29vbGVyLTAuOC41
      genome-assembly: unknown
      metadata: {}
      nbins: 24723
      nchroms: 19
      nnz: 141621591
      storage-mode: symmetric-upper
      sum: 21271216939042.75
      tool-url: !!binary |
        aHR0cHM6Ly9naXRodWIuY29tL2RlZXB0b29scy9IaUNNYXRyaXg= 

These values cannot be interpreted during the JSON file generation and therefore the import to HiGlass fails.

A quick lookup in the cool.py file of the hicmatrix library reveals the source of this. On line 364 - 397 the info dictionary of the new cooler file is generated where string conversion is explicitly handled by numpy.string_. However, the hdf5 library seems to be unable to understand this datatype and converts it to a binary object. Replacing numpy.string_ with the native Python str function resolves this problem and a quick check with cooler attrs gives:

'@attrs':
  bin-size: 100000
  bin-type: fixed
  creation-date: 2020-04-09 12:57:38.542590
  format: HDF5::Cooler
  format-url: https://github.com/mirnylab/cooler
  format-version: 3
  generated-by: HiCMatrix-11
  generated-by-cooler-lib: cooler-0.8.5
  genome-assembly: unknown
  metadata: {}
  nbins: 24723
  nchroms: 19
  nnz: 141621591
  storage-mode: symmetric-upper
  sum: 21271216939042.75
  tool-url: https://github.com/deeptools/HiCMatrix

I therefore propose to change replace numpy.string_ with str to ensure compatibility with HiGlass.