HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
114 stars 38 forks source link

Try xarray/dask/h5netcdf on top of h5pyd #30

Closed rsignell-usgs closed 6 years ago

rsignell-usgs commented 7 years ago

h5netcdf is a pythonic interface to netcdf4 files using h5py.

It would be super cool to try h5netcdf on top of h5pyd instead.

If that worked we could try xarray with dask on top of h5pyd.

And if that worked, it would be amazing....

jreadey commented 7 years ago

What do you think that pythonic way would be to switch the dependent module? E.g. something like this:

if "USE_H5PYD" in os.environ and os.environ["USE_H5PYD"]:
  import h5pyd
else:
  import h5py

Would that mess up packaging?

We could create an entirely new module (h5netcdfd?), but would then need to sync any changes from h5netcdf regularly.

shoyer commented 7 years ago

The h5py module is actually only explicitly used a handful of times inside h5netcdf: https://github.com/shoyer/h5netcdf/blob/master/h5netcdf/core.py

Most of the time, we use method calls on existing h5py.File objects.

What do you think that pythonic way would be to switch the dependent module?

Depending how much the interface needs to be changed to accommodate h5pyd, the cleanest way to do this is probably to add constructor arguments to the handful of h5netcdf functions/classes that open a file.

For example, we might change h5netcdf.File like so:

class File(Group):
    def __init__(self, path, mode='a', backend=h5py, **kwargs):
        self._h5file = backend.File(path, mode, **kwargs)

Then using h5pyd behind the scenes as a user is as simple as h5netcdf.File(..., backend=h5pyd). If modules are not a complete drop-in equivalent, then at least we could accept string names like backend='h5pyd'.

I certainly would be very happy to accept patches to add this flexibility in h5netcdf.

Generally checking environment variables for this sort of thing is discouraged, since it makes it hard to switch between options in user code (there are certainly legitimate cases for using both h5py and h5pyd at the same time).

rsignell-usgs commented 7 years ago

@shoyer , this sounds great. I'd submit a PR right now except we should first wait for the shared dimensions to be working, https://github.com/HDFGroup/h5pyd/issues/32, right @jreadey?

rsignell-usgs commented 6 years ago

@ajelenak-thg and @jreadey,

Just to record this somewhere, here's what I did to get a custom environment for the ESIP Winter Meeting (Jan 9-11, 2017), with xarray working with h5pyd:

With this h5pyd_env.yml:

name: h5pyd
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.6
  - h5py
  - nb_conda_kernels
  - pytz
  - requests
  - matplotlib
  - pip:
      - git+https://github.com/HDFGroup/h5pyd.git@master

I did:

conda env create -f h5pyd_env.yml
source activate h5pyd
conda install xarray
conda remove h5netcdf
pip install --no-deps --upgrade git+https://github.com/ajelenak-thg/h5netcdf.git@h5pyd
conda install --no-deps xarray
rsignell-usgs commented 6 years ago

BTW, thanks to @ocefpaf for helping me figure this out!

rsignell-usgs commented 6 years ago

Xarray is now working nicely with HSDS: https://gist.github.com/rsignell-usgs/cc2d2d4fe1930bd949119e543b56bce1

Closing this issue, while Dask tasks remain: https://github.com/pangeo-data/pangeo/issues/75#issuecomment-357734564