Open thwllms opened 1 month ago
Does kerchunk.hdf not already cope with your "idiosyncratic" HDF5 files? A few recent changes were made to target nested trees in HDF, but I'd be interested to know in what manner it fails. Although there are some netCDF-influenced choices, JSON reference output should be fairly immune to those, I think.
The main shortcoming in kerchunk for your files would be combining many of the reference sets into a logical aggregate dataset. It doesn't sound like you are doing that yet.
Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?
This is python, so it doesn't really matter. I certainly didn't anticipate that anyone would want to call them from outside the class.
Does kerchunk.hdf not already cope with your "idiosyncratic" HDF5 files? A few recent changes were made to target nested trees in HDF, but I'd be interested to know in what manner it fails. Although there are some netCDF-influenced choices, JSON reference output should be fairly immune to those, I think.
SingleHdf5ToZarr.translate
works, but due to the way the HDF is structured it's difficult to manage. (Or at least I haven't totally figured it out...) HEC-RAS HDF5 Datasets within the same Group have mismatched and/or unlabeled dimensions. We end up with errors and phony_dim
s. For example, with this JSON reference:
>>> import h5py
>>> import xarray as xr
>>> from kerchunk.hdf import SingleHdf5ToZarr
>>> ras_h5 = h5py.File("/mnt/c/temp/ElkMiddle.p01.hdf", "r")
>>> zmeta = SingleHdf5ToZarr(ras_h5, "file:///mnt/c/temp/ElkMiddle.p01.hdf").translate()
>>> import json
>>> with open("ElkMiddle.p01.hdf.json", "w") as z:
... z.write(json.dumps(zmeta, indent=2))
...
722087
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "ElkMiddle.p01.hdf.json"}})
>>> ds
<xarray.Dataset> Size: 0B
Dimensions: ()
Data variables:
*empty*
Attributes:
File Type: HEC-RAS Results
File Version: HEC-RAS 6.3.1 September 2022
Projection: PROJCS["USA_Contiguous_Albers_Equal_Area_Conic_USGS_versio...
Units System: US Customary
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "ElkMiddle.p01.hdf.json"}}, group="/Results/Unsteady/Output/Output Blocks/Base Output/Unsteady Time Series/2D Flow Areas/ElkMid
dle")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/api.py", line 588, in open_dataset
backend_ds = backend.open_dataset(
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1188, in open_dataset
ds = store_entrypoint.open_dataset(
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/store.py", line 58, in open_dataset
ds = Dataset(vars, attrs=attrs)
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/dataset.py", line 713, in __init__
variables, coord_names, dims, indexes, _ = merge_data_and_coords(
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/dataset.py", line 427, in merge_data_and_coords
return merge_core(
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/merge.py", line 705, in merge_core
dims = calculate_dimensions(variables)
File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/variable.py", line 3009, in calculate_dimensions
raise ValueError(
ValueError: conflicting sizes for dimension 'phony_dim_1': length 33101 on 'Face Velocity' and length 14606 on {'phony_dim_0': 'Cell Cumulative Precipitation Depth', 'phony_dim_1': 'Cell Cumulative Precipitation Depth'}
The structure is complex. Any developer who has worked with RAS data could rant about it but ultimately the point is that a helping hand is needed extract data from RAS HDF5 files into nice xarray objects, hence the rashdf project.
We considered using xarray-datatree
for the project but ultimately decided not to use it at this time since a) DataTree
seems to be making its way into xarray proper, b) I couldn't figure out how to get Kerchunk and datatree working together, and c) our driving use case is pretty focused on specific data within the files and it seemed simpler to start by tackling those specific needs. Here's a visualized datatree of a typical RAS HDF file.
The main shortcoming in kerchunk for your files would be combining many of the reference sets into a logical aggregate dataset. It doesn't sound like you are doing that yet.
Combining reference sets is not in the rashdf
library itself and I'm not sure we plan to add it there, but logical aggregate is definitely the idea.
Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?
This is python, so it doesn't really matter. I certainly didn't anticipate that anyone would want to call them from outside the class.
Seems like a reasonable assumption. Calling methods with leading underscores feels naughty to me, but if those methods are unlikely to change then maybe we're alright.
Sorry, didn't mean to close this.
That truly is a gnarly data hierarchy. I wonder whether where kerchunk gets confused, is groups which have both child group(s) and array(s). The zarr model does support that, but I wouldn't be too surprised if we have extra hidden assumptions.
Writing specialised versions of the file scanner for a well-understood use case is of course fine, and I can try to help however I can.
Is there a better way to approach highly nested or otherwise idiosyncratic HDF5 data with Kerchunk?
It might be worthwhile figuring out what exactly is going wrong with pure kerchunk, but this doesn't touch most current users, as there is a lot of netCDF-compliant data, which is far simpler in structure. If you make interesting reference sets with your code, I'd be happy to see them and even better would be a blog post about your process :)
I'm working on development of the rashdf library for reading HEC-RAS HDF5 data. A big part of the motivation for development of the library is stochastic hydrologic/hydraulic modeling.
We want to be able to generate Zarr metadata for stochastic HEC-RAS outputs, so that e.g., results for many different stochastic flood simulations from a given RAS model can be opened as a single xarray Dataset. For example, results for 100 different simulations could be concatenated in a new
simulation
dimension, with coordinates being the index number of each simulation. It took me a little while to figure out how to make that happen because RAS HDF5 data is highly nested and doesn't conform to typical conventions.The way I implemented it is hacky:
xr.Dataset
pulled from the HDF file and the path of each childxr.DataArray
within the HDF file,filters = SingleHdf5ToZarr._decode_filters(None, hdf_ds)
storage_info = SingleHdf5ToZarr._storage_info(None, hdf_ds)
storage_info
xr.Dataset
to azarr.MemoryStore
withcompute=False
, to generate the framework of what's needed for the Zarr metadatazarr.MemoryStore
and decodezarr.MemoryStore
objects,filters
, andstorage_info
into a dictionary and finally returnI suppose my questions are:
SingleHdf5ToZarr._decode_filters
and_storage_info
methods be made public?