HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
118 stars 38 forks source link

Round-tripping with hsload and hsget #38

Open rsignell-usgs opened 7 years ago

rsignell-usgs commented 7 years ago

@jreadey, you used hsload to put our Hurricane Sandy netcdf4 file on HSDS:

(IOOS) rsignell@0e6be50c3dc2:~$ hsls /home/john/sandy.nc/

john                            domain   2017-09-07 22:11:07 /home/john/sandy.nc
1 items

If I try to use hsget to get that dataset back, I get errors:

(IOOS) rsignell@0e6be50c3dc2:~$ hsget /home/john/sandy.nc sandy.nc
2017-10-14 14:00:39,424 ERROR: failed to create dataset: Scalar datasets don't support chunk/filter options
ERROR: failed to create dataset: Scalar datasets don't support chunk/filter options
2017-10-14 14:01:50,324 ERROR: failed to create dataset: Scalar datasets don't support chunk/filter options

And although I do end up with a sandy.nc file, if I try to ncdump it, it doesn't work (see below). I guess that is not too surprising in light of #32, right?

But do you think one day we will be able to round-trip a dataset using hsload and hsget?


(IOOS) rsignell@0e6be50c3dc2:~$ ncdump -h sandy.nc
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 140414440146688:
  #000: H5L.c line 1183 in H5Literate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #001: H5Gint.c line 844 in H5G_iterate(): error iterating over links
    major: Symbol table
    minor: Iteration failed
  #002: H5Gobj.c line 708 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #003: H5Gstab.c line 566 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #004: H5B.c line 1221 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #005: H5B.c line 1177 in H5B_iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #006: H5Gnode.c line 1039 in H5G__node_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 140414440146688:
  #000: H5L.c line 1183 in H5Literate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #001: H5Gint.c line 844 in H5G_iterate(): error iterating over links
    major: Symbol table
    minor: Iteration failed
  #002: H5Gobj.c line 708 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #003: H5Gstab.c line 566 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #004: H5B.c line 1221 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #005: H5B.c line 1177 in H5B_iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #006: H5Gnode.c line 1039 in H5G__node_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
ncdump: sandy.nc: NetCDF: HDF error
(IOOS) rsignell@0e6be50c3dc2:~$
jreadey commented 7 years ago

There are some updates in the v0.2.7 that enable files with dimension scales to be correctly uploaded to the HSDS service. There is still a problem with downloading the files which will require some HSDS updates to resolve.

Also, I noticed that there are some attributes in the sand.nc file that can't be read with h5py. These appear to be related to this issue: https://github.com/h5py/h5py/issues/719.

rsignell-usgs commented 7 years ago

Looks like this was fixed in NetCDF on September 1: https://github.com/Unidata/netcdf-c/commit/4dd8e380c183a016a5edec5f5fd945b1e0954a5f

and released in version 4.5.0 on Oct 20 https://github.com/Unidata/netcdf-c/releases/tag/v4.5.0

I will try converting those files to netcdf4 again and see if that fixes the problem.

jreadey commented 7 years ago

Ok thanks. For cases where a netcdf file with the bug is used, I've added a check so that hsload just prints a warning message and continues on with other attributes.

rsignell-usgs commented 7 years ago

I used nccopy from NetCDF 4.5.0 to recreate my Sandy netcdf4 files from the original netcdf3 files:

nccopy -7 -d 7 Sandy_ocean_his.nc Sandy_ocean_his_nc4c.nc

and then used hsload to write to HSDS. The only error I got was:

$ hsload Sandy_ocean_his_nc4c.nc /home/rsignell/sandy2.nc
2017-12-03 19:42:48,871 utillib.py:266 ERROR: failed to create attribute script_file of object / -- unknown object type
ERROR: failed to create attribute script_file of object / -- unknown object type
rsignell-usgs commented 7 years ago

When I try to load the HSDS dataset using xarray with the h5netcdf engine:

import xarray as xr
ds = xr.open_dataset('Sandy_ocean_his.nc')
ds = xr.open_dataset('/home/rsignell/sandy2.nc', engine='h5netcdf')

I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-8b828d9bcc43> in <module>()
----> 1 ds = xr.open_dataset('/home/rsignell/sandy2.nc', engine='h5netcdf')

~/.conda/envs/hsds/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables)
    292         elif engine == 'h5netcdf':
    293             store = backends.H5NetCDFStore(filename_or_obj, group=group,
--> 294                                            autoclose=autoclose)
    295         elif engine == 'pynio':
    296             store = backends.NioDataStore(filename_or_obj,

~/.conda/envs/hsds/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in __init__(self, filename, mode, format, group, writer, autoclose)
     62         opener = functools.partial(_open_h5netcdf_group, filename, mode=mode,
     63                                    group=group)
---> 64         self.ds = opener()
     65         if autoclose:
     66             raise NotImplementedError('autoclose=True is not implemented '

~/.conda/envs/hsds/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in _open_h5netcdf_group(filename, mode, group)
     48 def _open_h5netcdf_group(filename, mode, group):
     49     import h5netcdf.legacyapi
---> 50     ds = h5netcdf.legacyapi.Dataset(filename, mode=mode)
     51     with close_on_error(ds):
     52         return _nc4_group(ds, group, mode)

/notebooks/rsignell/github/h5netcdf/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, **kwargs)
    584         # if we actually use invalid NetCDF features.
    585         self._write_ncproperties = (invalid_netcdf is not True)
--> 586         super(File, self).__init__(self, self._h5path)
    587 
    588     def _check_valid_netcdf_dtype(self, dtype, stacklevel=3):

/notebooks/rsignell/github/h5netcdf/h5netcdf/core.py in __init__(self, parent, name)
    241                     # variables.
    242                     self._current_dim_sizes[k] = \
--> 243                         self._determine_current_dimension_size(k, current_size)
    244 
    245                     if dim_id is None:

/notebooks/rsignell/github/h5netcdf/h5netcdf/core.py in _determine_current_dimension_size(self, dim_name, max_size)
    286 
    287             for i, var_d in enumerate(var.dims):
--> 288                 name = _name_from_dimension(var_d)
    289                 if name == dim_name:
    290                     max_size = max(var.shape[i], max_size)

/notebooks/rsignell/github/h5netcdf/h5netcdf/core.py in _name_from_dimension(dim)
     34     # First value in a dimension is the actual dimension scale
     35     # which we'll use to extract the name.
---> 36     return dim[0].name.split('/')[-1]
     37 
     38 

AttributeError: 'NoneType' object has no attribute 'split'

This was after changing import h5py to import h5phd as h5py in h5netcdf.

ghost commented 7 years ago

@rsignell-usgs We are aware of this problem with h5netcdf and h5pyd. h5pyd currently cannot return the HDF5 path name for HDF5 objects that are not accessed following the file's hierarchy. Returning HDF5 dimension scale datasets as h5py.Dataset is one of those types of access.

Are you working on enabling h5netcdf to work with h5pyd? I'm asking because I just started working on this in the last couple of days. No need for us to duplicate the effort.

rsignell-usgs commented 7 years ago

@ajelenak-thg, no, I'm not working on it. I just forked h5netcdf and replaced: import h5py with import h5pyd as h5py and then observed that didn't work.

ghost commented 7 years ago

@rsignell-usgs That's how far I was able to progress, too. 😃 I think @jreadey is working on a fix.