HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
110 stars 39 forks source link

Issues with Virtual Datasets (VDS) #89

Open jbhatch opened 4 years ago

jbhatch commented 4 years ago

There are several issues that result from using H5PYD or the HSDS CLI commands on an HDF5 VDS made using H5PY. If a few HDF5 files are combined in a VDS, and if H5PYD is used to send the VDS to the HSDS, a tiny (~KB-sized) and unusable file is produced on the HSDS. This unusable file shows up on the HSDS with an HSLS command, but cannot be retrieved back to an NFS with HSGET. However, if HSLOAD is used to send the VDS file to the HSDS, all of the data that comprises the VDS is written to the HSDS, effectively undoing the virtual aspect of the VDS.

jreadey commented 4 years ago

HSDS doesn't support VDS and hsload reading HDF5 files with h5py isn't "VDS aware", but I would have thought the ingest should have worked in the sense of setting up the HSDS datasets that include all the data from the source files.

Can you post some sample VDS files? I could do some experimentation.

BTW, there's another approach to combining multiple files that can be used in HSDS... an HSDS dataset can be created that maps to chunks stored in one or more HDF5 files (as long as they have the same chunk shape, type, and compression options). You can read about how this works here: https://github.com/HDFGroup/hsds/blob/master/docs/design/single_object/SingleObject.md.

This approach is not as general as VDS but works well when data you are pulling in aligns on the chunk boundaries. We used this approach to aggregate data from 7850 HDF5 files to create one (7850, 720, 1440) dataset. See: https://github.com/HDFGroup/hdflab_examples/blob/master/NCEP3/ncep3_example.ipynb.

The hsload util isn't able to link to multiple files, so the chunk map needs to be setup manually. I can walk you through it if you are interested.