NCAS-CMS / pyfive

A pure Python HDF5 file reader
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

h5py Datatypes #8

Closed bnlawrence closed 4 months ago

bnlawrence commented 4 months ago

H5py has a class Datatype. This is part of uber ticket #7!

Unfortunately, when creating NetCDF groups, h5netcdf does it's own iteration over an HDF5 group, and looks for instances of DataTypes:

for k, v in self._h5group.items():
            if isinstance(v, self._root._h5py.Group):
                # add to the groups collection if this is a h5py(d) Group
                # instance
                self._groups.add(k)
            # todo: add other user types here
            elif isinstance(
                v, self._root._h5py.Datatype
            ) and self._root._h5py.check_enum_dtype(v.dtype):
                self._enumtypes.add(k)
            else:
                if v.attrs.get("CLASS") == b"DIMENSION_SCALE":
                    # add dimension and retrieve size
                    self._dimensions.add(k)
                else:
                    if self._root._phony_dims_mode is not None:
                        # check if malformed variable and raise
                        if _unlabeled_dimension_mix(v) == "unlabeled":
                            # if unscaled variable, get phony dimensions
                            phony_dims |= Counter(v.shape)

                if not _netcdf_dimension_but_not_variable(v):
                    if isinstance(v, self._root._h5py.Dataset):
                        self._variables.add(k)

In this self._root.h5py is in fact, pyfive after my backend substitution. That means we need to have pyfive support for Datatype and check_enum_dtype or I need another strategy.

In pyfive, I think the Datatype corresponds to DatatypeMessage. IF it's that simple, there are a couple of routes to solving this, we modify pyfive to look more like h5py, or we create a new Datatype class which subclasses DatatypeMessage. I suspect I will go that route ...

The H5py datatype is defined here: https://github.com/h5py/h5py/blob/master/h5py/_hl/datatype.py

bnlawrence commented 4 months ago

Yeah, well, I got this wrong:

The relevant loop inside pyfive h5netcdf branch as it currently stands is this one:

dataobjs = DataObjects(self.file._fh, link_target)
if dataobjs.is_dataset:
       if additional_obj != '.':
              raise KeyError('%s is a dataset, not a group' % (obj_name))
       return Dataset(obj_name, DatasetDataObject(self.file._fh, link_target), self)
return Group(obj_name, dataobjs, self)[additional_obj]

Which unfortunately means that when faced with an actual Datatype enum, it reports it as a group. That is definitely a bug.

bnlawrence commented 4 months ago

At this point the logic is that everything that is not a dataset is a group, but we know that's not true, this is a portion of the h5dump of the offending content:

DATATYPE "enum_t" H5T_ENUM {
      H5T_STD_U8LE;
      "stratus"          1;
      "cumulus"          2;
      "nimbus"           3;
      "missing"          255;
   };
DATASET "enum_var" {
      DATATYPE  H5T_ENUM {
         H5T_STD_U8LE;
         "stratus"          1;
         "cumulus"          2;
         "nimbus"           3;
         "missing"          255;
      } ...
";

So we need to do something a wee bit different right at this point, so we can warn the user that we are ignoring the not implemented type rather than pretend it is a group, and then if they try and access enum_var, they get a NotImplementedError.

bnlawrence commented 4 months ago

Ok, so now - https://github.com/bnlawrence/pyfive/commit/99945989d215cd9fbd893d62dd2f0d06939631ab - at least pyfive can read a file with this in it, but raises a warning when it finds the datatype message, and raises a notimplementederror when one tries to read that particular data variable. That seems to be the right sort of behaviour until such time as we implement enum support (https://github.com/bnlawrence/pyfive/issues/9, which we may not).

bnlawrence commented 4 months ago

Test support for "sensible things" is here: c12b5b3