4dn-dcic / hic2cool

Lightweight converter between hic and cool contact matrices.
MIT License
63 stars 7 forks source link

cool required attributes not found in hic files converted to cool files #27

Closed koustav-pal closed 5 years ago

koustav-pal commented 5 years ago

Hi

I have a package which allows users to import and read cool files. But, users have reported that when using the hic2cool converter, the files cannot be read. Investigation of the issue has lead me to believe that the problem is in hic2cool. Read on for details of the procedure.

I installed the hic2cool utility and tried to convert a .hic file to cool format.

I tried to use this file: https://data.4dnucleome.org/files-processed/4DNFIH3OTR14/

After converting it to cool format, I found that many of the attributes which are defined as required attributes as per the cooler schema, version3 and listed in version2 are not introduced by the converter.

format; Length: 1; value: HDF5::Cooler
Error in ReturnH5Attribute(Handle = Brick.handler, name = An.attribute,  :
  Attribute format-versionnot found in HDF file.

format-version; Length: 1; value: Error in ReturnH5Attribute(Handle = Brick.handler, name = An.attribute,  :
  Attribute format-versionnot found in HDF file.

bin-type; Length: 1; value: fixed
bin-size; Length: 1; value: 50000
Error in ReturnH5Attribute(Handle = Brick.handler, name = An.attribute,  :
  Attribute storage-modenot found in HDF file.

storage-mode; Length: 1; value: Error in ReturnH5Attribute(Handle = Brick.handler, name = An.attribute,  :
  Attribute storage-modenot found in HDF file.

My package depends on the format-version attribute to decipher between v1 v2 and v3 formats. Without it a cool file is not processed. Please provide a bugfix towards this issue.

carlvitzthum commented 5 years ago

Hi @koustav-pal

Thanks for identifying this issue. I have released hic2cool version 0.6.0, which adds format-version and storage-type metadata attributes to the files. Per the cooler data layout specifications, these are added to the top level cooler info for single resolution files and to each individual resolution in multi-res files.

If needed, you can update the attributes of files generated by previous versions of hic2cool using the command line or programmatically:

from hic2cool import hic2cool_update
# will update the input cooler file directly. silent=True disables command line confirmation
hic2cool_update(<cooler file>, silent=True)
# OR, leave the input file unchanged and write to a new one
hic2cool_update(<cooler file>, <target cooler file>, silent=True)

Please let me know if this fix works for you.

koustav-pal commented 5 years ago

Hi @carlvitzthum

My tests show that the issue has been fixed. Thanks a lot!

koustav-pal commented 5 years ago

Hi @carlvitzthum,

My apologies as I had not correctly interpreted your fix. But, going over the cooler specification once again (v3, and v2), it mentions that the metadata information is present at the root-level, which I interpret as being /.

Users are still having trouble importing the mcool files. They reported that for single-resolution mcool files it is working fine, whereas for multi-resolution it is not.

Therefore, I downloaded an mcool file, specifically this one from the 4DN data portal. This is a multi-resolution mcool file, and the file import for mcools in my package is working fine.

If I am not mistaken, currently the format-version metadata is being created at the /resolutions path. But as per my understanding of the schema, it should be present at the / path.

Can you please move the required metadata information to the root level (/) HDF5 attributes as in the existing mcool files present in the 4DN data portal?

carlvitzthum commented 5 years ago

Hi @koustav-pal

No problem. It could be that I misunderstood the schema specification when I originally set up the attributes. What you said is correct: for single-resolution coolers I put all attributes at the root level and on multi-res I put the attributes separately for each resolution (the same place where data collections are located). This is the structure outlined here in the docs, but that does seem inconsistent with the first statement here.

From the mcool file you linked, I find some attributes on the root level. As you point out, I have not been populating these for multi-resolution coolers.

>>> h5 = h5py.File('4DNFI2Y6GTWP.mcool', 'r')
>>> dict(h5.attrs)
{'format': 'HDF5::MCOOL', 'format-version': 2}

But the majority attributes are on the individual resolutions when looking at the same file. As an aside, the format-version is actually different on the resolution's attributes when compared to the root level.

>>> dict(h5['resolutions']['10000'].attrs)
{'bin-size': 10000, 'bin-type': 'fixed', 'creation-date': '2019-04-04T08:00:05.726124', 'format': 'HDF5::Cooler', 'format-url': 'https://github.com/mirnylab/cooler', 'format-version': '3', 'generated-by': 'cooler-0.8.3', 'genome-assembly': 'unknown', 'metadata': '{}', 'nbins': 308837, 'nchroms': 24, 'nnz': 178478386, 'storage-mode': 'symmetric-upper', 'sum': 292443650}

I am happy to add some attributes to the root level for multi-res files, but I want to make sure they're the right ones and that this process is standardized. . @nvictus could you weigh in on this?

nvictus commented 5 years ago

Hi Koustav,

the metadata information is present at the root-level, which I interpret as being /.

Actually, the metadata is meant to be the root level of every cooler data collection. That means, if a cooler lives at /resolutions/10000, then the metadata for that resolution is attached directly to that group. Naturally, it would be impossible to attach metadata to every resolution at the very root of the file, because HDF5 attributes do not support nesting.

Carl is populating top-level / metadata for the MCOOL layout itself, which in its second version (the first was the numbered layout (/0, /1, ...). So this is all correct. I can try to clarify this and the meaning of "root-level" in the docs.

If I am not mistaken, currently the format-version metadata is being created at the /resolutions path.

That shouldn't be the case. Each data collection (resolution) in the file should have metadata at its root, which should be resolutions/{xxxx}. You can quickly double-check this with the cooler attrs command.

koustav-pal commented 5 years ago

Hi Nezar,

So, in this case is the presence of the format-version at the / level of the mcool files a bug?

I guess for cool files the existence of the metadata at the / level is to be expected, correct?

nvictus commented 5 years ago

Hi Koustav,

Like I said, format-version at the / level refers to the MCOOL layout ('format': 'HDF5::MCOOL'), and not the COOL format. So it isn't a bug, but I apologize for the confusion.

cool files the existence of the metadata at the / level is to be expected, correct?

Yes, for a single-resolution .cool file, the root of the data collection and the root of the file are identical.

nvictus commented 5 years ago

Note that there are a few file introspection functions to determine if a file is single-res or multi-res (i.e. mcool), and a list_coolers function to list all individual cooler data collections inside a file regardless of the file's layout: https://cooler.readthedocs.io/en/latest/api.html#cooler-fileops

carlvitzthum commented 5 years ago

@koustav-pal I will go ahead and add {'format': 'HDF5::MCOOL', 'format-version': 2} to the / attributes for mcool files generated by hic2cool. Even if it's not strictly necessary, I think it is helpful. I will release shortly and let you know.

carlvitzthum commented 5 years ago

hic2cool 0.7.1 is released, which now adds the aforementioned attributes to the / level. I have also added an update function so files created by previous versions of hic2cool can be brought up-to-date using hic2cool update.

Please let me know if there are further issues.

koustav-pal commented 5 years ago

Thanks a lot for the help @carlvitzthum.

I think this issue is resolved.