HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
111 stars 39 forks source link

Cannot access compound dataset which contains array of enum #39

Open tmick0 opened 6 years ago

tmick0 commented 6 years ago

I am trying to access a dataset which contains an enum array via h5serv, however h5pyd throws the following exception:

  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/group.py", line 335, in __getitem__
    tgt = getObjByUuid(link_json['collection'], link_json['id'])
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/group.py", line 311, in getObjByUuid
    tgt = Dataset(DatasetID(self, dataset_json))
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/dataset.py", line 416, in __init__
    self._dtype = createDataType(self.id.type_json)
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/h5type.py", line 725, in createDataType
    dt = createDataType(field['type'])  # recursive call
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/h5type.py", line 732, in createDataType
    dtRet = createBaseDataType(typeItem)  # create non-compound dt
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/h5type.py", line 638, in createBaseDataType
    raise TypeError("Array Type base type must be integer, float, or string")
TypeError: Array Type base type must be integer, float, or string

We can create a minimal dataset to reproduce the error using h5py as follows:

import h5py
import numpy as np

f = h5py.File('test.h5', 'w')
enum_type = h5py.special_dtype(enum=('i', {"FOO": 0, "BAR": 1, "BAZ": 2}))
comp_type = np.dtype([('my_enum_array', enum_type, 10), ('my_int', 'i'), ('my_string', np.str_, 32)])
dataset = f.create_dataset("test", (4,), comp_type)
f.close()

We then put it in h5serv's data directory and try to access it:

import h5pyd
f = h5pyd.File("test.hdfgroup.org", endpoint="http://127.0.0.1:5000")
print(f['test'])

This yields the above exception. Note that we are able to access the dataset as expected using regular h5py.

Applying the following patch to h5pyd prevents the exception and returns a dataframe, however it doesn't seem to give the correct behavior (the enum array seems to be treated as an int array):

diff --git a/h5pyd/_hl/h5type.py b/h5pyd/_hl/h5type.py
index 4ce6cb4..10ce562 100644
--- a/h5pyd/_hl/h5type.py
+++ b/h5pyd/_hl/h5type.py
@@ -637 +637 @@ def createBaseDataType(typeItem):
-            if arrayBaseType["class"] not in ('H5T_INTEGER', 'H5T_FLOAT', 'H5T_STRING'):
+            if arrayBaseType["class"] not in ('H5T_INTEGER', 'H5T_FLOAT', 'H5T_STRING', 'H5T_ENUM'):

I'm not sure how to properly proceed in working around this. Thanks in advance for your advice.

jreadey commented 6 years ago

Hi, it looks like the test coverage for enum types is pretty thin - we'll want to beef this up.

I'm a bit confused just using h5py with your HDF5 file.
If I do this:

f = h5py.File("test.h5", 'r')
dset = f['test']
print(dset.dtype)
dt = dset.dtype["my_enum_array"]
print("enum dt: {}".format(dt))
print(h5py.check_dtype(enum=dt))

I'm getting "None" for the last output line. Is this what you see?

tmick0 commented 6 years ago

Yes, it seems that the metadata is lost if we access it that way. However, if I write f['test']['my_enum_array'].dtype.metadata (or equivalently, h5py.check_dtype(enum=f['test']['my_enum_array'].dtype)), the enum dictionary is retrieved as expected. This is pretty confusing behavior indeed.