frejanordsiek / hdf5storage

Python package to read and write a wide range of Python types to/from HDF5 formatted files. Can read/write data to the HDF5 based Matlab v7.3 MAT files.
BSD 2-Clause "Simplified" License
83 stars 24 forks source link

Error using 'loadmat' with h5py 3.0 #102

Closed Blubbaa closed 3 years ago

Blubbaa commented 4 years ago

I have recently upgraded to h5py 3.0.0, as i need some of the new features. As #101 also pointed out, currently hdf5storage is broken when using 3.0.0. However for me using the master branch with v2.0 does not fix it. I am adding an example here, as I am frequently loading v7.3 .mat files from Matlab.

The following code produces an ValueError, which is actually hidden if you supply a list of variable_names. After some debugging and reading the change list from 3.0, I still don't really understand exactly whats going wrong there. It seems to read an attribute named 'MATLAB_fields' from the file, thats where it fails.

Example

print("h5py version: ", h5py.__version__)
print("hdf5storage version: ", hdf5storage.__version__)

file_name = r'test_file.mat'
file_dict = hdf5storage.loadmat(file_name, appendmat=False, variable_names=None)

Output

h5py version:  3.0.0
hdf5storage version:  0.2
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-15-4a4156bc29cd> in <module>
      7 
      8 file_name = r'./data/test_file.mat'
----> 9 file_dict = hdf5storage.loadmat(file_name, appendmat=False, variable_names=None)
     10 
     11 

c:\users\jonas\documents\phd\venv\lib\site-packages\hdf5storage\__init__.py in loadmat(file_name, mdict, appendmat, variable_names, marshaller_collection, **keywords)
   2557         with File(filename, writable=False, options=options) as f:
   2558             if variable_names is None:
-> 2559                 data = {pathesc.unescape_path(k): v for k, v in f.items()}
   2560             else:
   2561                 data = dict()

c:\users\jonas\documents\phd\venv\lib\site-packages\hdf5storage\__init__.py in <dictcomp>(.0)
   2557         with File(filename, writable=False, options=options) as f:
   2558             if variable_names is None:
-> 2559                 data = {pathesc.unescape_path(k): v for k, v in f.items()}
   2560             else:
   2561                 data = dict()

C:\Program Files\Python38\lib\_collections_abc.py in __iter__(self)
    742     def __iter__(self):
    743         for key in self._mapping:
--> 744             yield (key, self._mapping[key])
    745 
    746 ItemsView.register(dict_items)

c:\users\jonas\documents\phd\venv\lib\site-packages\hdf5storage\__init__.py in __getitem__(self, path)

-> 2053         return self.reads((path, ))[0]
   2054 
   2055     def __setitem__(self, path, data):

c:\users\jonas\documents\phd\venv\lib\site-packages\hdf5storage\__init__.py in reads(self, paths)
   1919                         + groupname + '.')
   1920                 # Hand off everything to the low level reader.
-> 1921                 datas.append(utilities.read_data(self._file,
   1922                                                  self._file[groupname],
   1923                                                  targetname,

c:\users\jonas\documents\phd\venv\lib\site-packages\hdf5storage\utilities.py in read_data(f, grp, name, options, dsetgrp)
    210     # Get all attributes with values.
    211     defaultfactory = type(None)
--> 212     attributes = collections.defaultdict(defaultfactory,
    213                                          dsetgrp.attrs.items())
    214 

c:\users\jonas\documents\phd\venv\lib\site-packages\h5py\_hl\base.py in __iter__(self)
    430         with phil:
    431             for key in self._mapping:
--> 432                 yield (key, self._mapping.get(key))
    433 
    434 

C:\Program Files\Python38\lib\_collections_abc.py in get(self, key, default)
    658         'D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.'
    659         try:
--> 660             return self[key]
    661         except KeyError:
    662             return default

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

c:\users\jonas\documents\phd\venv\lib\site-packages\h5py\_hl\attrs.py in __getitem__(self, name)
     75 
     76         arr = numpy.ndarray(shape, dtype=dtype, order='C')
---> 77         attr.read(arr, mtype=htype)
     78 
     79         string_info = h5t.check_string_dtype(dtype)

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\h5a.pyx in h5py.h5a.AttrID.read()

h5py\_proxy.pyx in h5py._proxy.attr_rw()

h5py\_conv.pyx in h5py._conv.vlen2ndarray()

h5py\_conv.pyx in h5py._conv.conv_vlen2ndarray()

ValueError: data type must provide an itemsize
kb- commented 3 years ago

savemat is also broken with h5pi 3.x. The following code stopped working: hdf5storage.savemat(file, data, format='7.3', oned_as='row', store_python_metadata=True, matlab_compatible=True)

Reverting to h5pi to 2.10.0 lets it work with the following warning:

\lib\site-packages\hdf5storage__init__.py: 1234 : H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details. f = h5py.File(filename)

frejanordsiek commented 3 years ago

Sorry I have taken so long to get around to this.

The problem appears is a backwards incompatible change in h5py or a bug. Specifically, the problem comes up with reading the 'MATLAB_fields' Attribute which has a quite unusual type. It can be written, but it can no longer be read in any way except probably through h5py's low level API which is no longer documented.

The bug shows up if one does the following to make an Attribute with the same type

>>> import numpy, h5py
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((1, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>> f.attrs['test']

The output from h5dump data.h5 is

HDF5 "data.h5" {
GROUP "/" {
   ATTRIBUTE "test" {
      DATATYPE  H5T_VLEN { H5T_STRING {
         STRSIZE 1;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }}
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): ("a", "b")
      }
   }
}
}

I am going to bring this up with h5py and see what can be done about it, including whether there is a good work around using the low level API (the more raw libhdf5 bindings).

sethrj commented 3 years ago

See https://github.com/h5py/h5py/issues/1817

frejanordsiek commented 3 years ago

Workarounds added in commit 3008efs for the main branch and commit a63128b for the 0.1.x branch. The package should now work for h5py 3.0 and 3.1. I will be uploading version 0.1.16 to PyPI shortly.

frejanordsiek commented 3 years ago

Fixed for 32-bit little endian systems in commit 9f021ee for the 0.1.x branch and commit c8a306e for the main branch. I still don't know if it works on big-endian systems.

frejanordsiek commented 3 years ago

Had a bug in the commits fixing the issues on 32-bit systems. Recent commits fix that.

frejanordsiek commented 3 years ago

Just released version 0.1.17 on PyPI which includes the fix.