fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
298 stars 78 forks source link

Nested HDF5 Data / HEC-RAS #490

Open thwllms opened 1 month ago

thwllms commented 1 month ago

I'm working on development of the rashdf library for reading HEC-RAS HDF5 data. A big part of the motivation for development of the library is stochastic hydrologic/hydraulic modeling.

We want to be able to generate Zarr metadata for stochastic HEC-RAS outputs, so that e.g., results for many different stochastic flood simulations from a given RAS model can be opened as a single xarray Dataset. For example, results for 100 different simulations could be concatenated in a new simulation dimension, with coordinates being the index number of each simulation. It took me a little while to figure out how to make that happen because RAS HDF5 data is highly nested and doesn't conform to typical conventions.

The way I implemented it is hacky:

  1. Given an xr.Dataset pulled from the HDF file and the path of each child xr.DataArray within the HDF file,
  2. Get the filters for each DataArray: filters = SingleHdf5ToZarr._decode_filters(None, hdf_ds)
  3. Get the storage info for each DataArray: storage_info = SingleHdf5ToZarr._storage_info(None, hdf_ds)
  4. Build out metadata for chunks using storage_info
  5. "Write" the xr.Dataset to a zarr.MemoryStore with compute=False, to generate the framework of what's needed for the Zarr metadata
  6. Read the objects generated by writing to zarr.MemoryStore and decode
  7. Assemble the zarr.MemoryStore objects, filters, and storage_info into a dictionary and finally return

I suppose my questions are:

martindurant commented 1 month ago

Does kerchunk.hdf not already cope with your "idiosyncratic" HDF5 files? A few recent changes were made to target nested trees in HDF, but I'd be interested to know in what manner it fails. Although there are some netCDF-influenced choices, JSON reference output should be fairly immune to those, I think.

The main shortcoming in kerchunk for your files would be combining many of the reference sets into a logical aggregate dataset. It doesn't sound like you are doing that yet.

Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?

This is python, so it doesn't really matter. I certainly didn't anticipate that anyone would want to call them from outside the class.

thwllms commented 1 month ago

Does kerchunk.hdf not already cope with your "idiosyncratic" HDF5 files? A few recent changes were made to target nested trees in HDF, but I'd be interested to know in what manner it fails. Although there are some netCDF-influenced choices, JSON reference output should be fairly immune to those, I think.

SingleHdf5ToZarr.translate works, but due to the way the HDF is structured it's difficult to manage. (Or at least I haven't totally figured it out...) HEC-RAS HDF5 Datasets within the same Group have mismatched and/or unlabeled dimensions. We end up with errors and phony_dims. For example, with this JSON reference:

>>> import h5py
>>> import xarray as xr
>>> from kerchunk.hdf import SingleHdf5ToZarr
>>> ras_h5 = h5py.File("/mnt/c/temp/ElkMiddle.p01.hdf", "r")
>>> zmeta = SingleHdf5ToZarr(ras_h5, "file:///mnt/c/temp/ElkMiddle.p01.hdf").translate()
>>> import json
>>> with open("ElkMiddle.p01.hdf.json", "w") as z:
...     z.write(json.dumps(zmeta, indent=2))
...
722087
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "ElkMiddle.p01.hdf.json"}})
>>> ds
<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    File Type:     HEC-RAS Results
    File Version:  HEC-RAS 6.3.1 September 2022
    Projection:    PROJCS["USA_Contiguous_Albers_Equal_Area_Conic_USGS_versio...
    Units System:  US Customary
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "ElkMiddle.p01.hdf.json"}}, group="/Results/Unsteady/Output/Output Blocks/Base Output/Unsteady Time Series/2D Flow Areas/ElkMid
dle")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/api.py", line 588, in open_dataset
    backend_ds = backend.open_dataset(
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1188, in open_dataset
    ds = store_entrypoint.open_dataset(
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/store.py", line 58, in open_dataset
    ds = Dataset(vars, attrs=attrs)
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/dataset.py", line 713, in __init__
    variables, coord_names, dims, indexes, _ = merge_data_and_coords(
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/dataset.py", line 427, in merge_data_and_coords
    return merge_core(
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/merge.py", line 705, in merge_core
    dims = calculate_dimensions(variables)
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/variable.py", line 3009, in calculate_dimensions
    raise ValueError(
ValueError: conflicting sizes for dimension 'phony_dim_1': length 33101 on 'Face Velocity' and length 14606 on {'phony_dim_0': 'Cell Cumulative Precipitation Depth', 'phony_dim_1': 'Cell Cumulative Precipitation Depth'}

The structure is complex. Any developer who has worked with RAS data could rant about it but ultimately the point is that a helping hand is needed extract data from RAS HDF5 files into nice xarray objects, hence the rashdf project.

We considered using xarray-datatree for the project but ultimately decided not to use it at this time since a) DataTree seems to be making its way into xarray proper, b) I couldn't figure out how to get Kerchunk and datatree working together, and c) our driving use case is pretty focused on specific data within the files and it seemed simpler to start by tackling those specific needs. Here's a visualized datatree of a typical RAS HDF file.

The main shortcoming in kerchunk for your files would be combining many of the reference sets into a logical aggregate dataset. It doesn't sound like you are doing that yet.

Combining reference sets is not in the rashdf library itself and I'm not sure we plan to add it there, but logical aggregate is definitely the idea.

Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?

This is python, so it doesn't really matter. I certainly didn't anticipate that anyone would want to call them from outside the class.

Seems like a reasonable assumption. Calling methods with leading underscores feels naughty to me, but if those methods are unlikely to change then maybe we're alright.

thwllms commented 1 month ago

Sorry, didn't mean to close this.

martindurant commented 1 month ago

That truly is a gnarly data hierarchy. I wonder whether where kerchunk gets confused, is groups which have both child group(s) and array(s). The zarr model does support that, but I wouldn't be too surprised if we have extra hidden assumptions.

Writing specialised versions of the file scanner for a well-understood use case is of course fine, and I can try to help however I can.

Is there a better way to approach highly nested or otherwise idiosyncratic HDF5 data with Kerchunk?

It might be worthwhile figuring out what exactly is going wrong with pure kerchunk, but this doesn't touch most current users, as there is a lot of netCDF-compliant data, which is far simpler in structure. If you make interesting reference sets with your code, I'd be happy to see them and even better would be a blog post about your process :)