Closed koustav-pal closed 5 years ago
Hi @koustav-pal
Thanks for identifying this issue. I have released hic2cool version 0.6.0, which adds format-version
and storage-type
metadata attributes to the files. Per the cooler data layout specifications, these are added to the top level cooler info for single resolution files and to each individual resolution in multi-res files.
If needed, you can update the attributes of files generated by previous versions of hic2cool using the command line or programmatically:
from hic2cool import hic2cool_update
# will update the input cooler file directly. silent=True disables command line confirmation
hic2cool_update(<cooler file>, silent=True)
# OR, leave the input file unchanged and write to a new one
hic2cool_update(<cooler file>, <target cooler file>, silent=True)
Please let me know if this fix works for you.
Hi @carlvitzthum
My tests show that the issue has been fixed. Thanks a lot!
Hi @carlvitzthum,
My apologies as I had not correctly interpreted your fix. But, going over the cooler specification once again (v3, and v2), it mentions that the metadata information is present at the root-level, which I interpret as being /
.
Users are still having trouble importing the mcool files. They reported that for single-resolution mcool files it is working fine, whereas for multi-resolution it is not.
Therefore, I downloaded an mcool file, specifically this one from the 4DN data portal. This is a multi-resolution mcool file, and the file import for mcools in my package is working fine.
If I am not mistaken, currently the format-version
metadata is being created at the /resolutions
path. But as per my understanding of the schema, it should be present at the /
path.
Can you please move the required metadata information to the root level (/
) HDF5 attributes as in the existing mcool files present in the 4DN data portal?
Hi @koustav-pal
No problem. It could be that I misunderstood the schema specification when I originally set up the attributes. What you said is correct: for single-resolution coolers I put all attributes at the root level and on multi-res I put the attributes separately for each resolution (the same place where data collections are located). This is the structure outlined here in the docs, but that does seem inconsistent with the first statement here.
From the mcool file you linked, I find some attributes on the root level. As you point out, I have not been populating these for multi-resolution coolers.
>>> h5 = h5py.File('4DNFI2Y6GTWP.mcool', 'r')
>>> dict(h5.attrs)
{'format': 'HDF5::MCOOL', 'format-version': 2}
But the majority attributes are on the individual resolutions when looking at the same file. As an aside, the format-version
is actually different on the resolution's attributes when compared to the root level.
>>> dict(h5['resolutions']['10000'].attrs)
{'bin-size': 10000, 'bin-type': 'fixed', 'creation-date': '2019-04-04T08:00:05.726124', 'format': 'HDF5::Cooler', 'format-url': 'https://github.com/mirnylab/cooler', 'format-version': '3', 'generated-by': 'cooler-0.8.3', 'genome-assembly': 'unknown', 'metadata': '{}', 'nbins': 308837, 'nchroms': 24, 'nnz': 178478386, 'storage-mode': 'symmetric-upper', 'sum': 292443650}
I am happy to add some attributes to the root level for multi-res files, but I want to make sure they're the right ones and that this process is standardized. . @nvictus could you weigh in on this?
Hi Koustav,
the metadata information is present at the root-level, which I interpret as being /.
Actually, the metadata is meant to be the root level of every cooler data collection. That means, if a cooler lives at /resolutions/10000
, then the metadata for that resolution is attached directly to that group. Naturally, it would be impossible to attach metadata to every resolution at the very root of the file, because HDF5 attributes do not support nesting.
Carl is populating top-level /
metadata for the MCOOL layout itself, which in its second version (the first was the numbered layout (/0
, /1
, ...). So this is all correct. I can try to clarify this and the meaning of "root-level" in the docs.
If I am not mistaken, currently the format-version metadata is being created at the /resolutions path.
That shouldn't be the case. Each data collection (resolution) in the file should have metadata at its root, which should be resolutions/{xxxx}
. You can quickly double-check this with the cooler attrs
command.
Hi Nezar,
So, in this case is the presence of the format-version
at the /
level of the mcool
files a bug?
I guess for cool files the existence of the metadata at the /
level is to be expected, correct?
Hi Koustav,
Like I said, format-version
at the /
level refers to the MCOOL layout ('format': 'HDF5::MCOOL'
), and not the COOL format. So it isn't a bug, but I apologize for the confusion.
cool files the existence of the metadata at the / level is to be expected, correct?
Yes, for a single-resolution .cool file, the root of the data collection and the root of the file are identical.
Note that there are a few file introspection functions to determine if a file is single-res or multi-res (i.e. mcool), and a list_coolers
function to list all individual cooler data collections inside a file regardless of the file's layout: https://cooler.readthedocs.io/en/latest/api.html#cooler-fileops
@koustav-pal I will go ahead and add {'format': 'HDF5::MCOOL', 'format-version': 2}
to the /
attributes for mcool files generated by hic2cool. Even if it's not strictly necessary, I think it is helpful. I will release shortly and let you know.
hic2cool 0.7.1 is released, which now adds the aforementioned attributes to the /
level. I have also added an update function so files created by previous versions of hic2cool can be brought up-to-date using hic2cool update
.
Please let me know if there are further issues.
Thanks a lot for the help @carlvitzthum.
I think this issue is resolved.
Hi
I have a package which allows users to import and read cool files. But, users have reported that when using the hic2cool converter, the files cannot be read. Investigation of the issue has lead me to believe that the problem is in hic2cool. Read on for details of the procedure.
I installed the hic2cool utility and tried to convert a .hic file to cool format.
I tried to use this file: https://data.4dnucleome.org/files-processed/4DNFIH3OTR14/
After converting it to cool format, I found that many of the attributes which are defined as required attributes as per the cooler schema, version3 and listed in version2 are not introduced by the converter.
My package depends on the format-version attribute to decipher between v1 v2 and v3 formats. Without it a cool file is not processed. Please provide a bugfix towards this issue.