JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
380 stars 138 forks source link

Add mid/high level interface for HDF5 Dimension Scale #1124

Open fergu opened 8 months ago

fergu commented 8 months ago

Opening this as a stub issue for adding a mid/high level interface to the Dimension Scale functions. I meant to do this a while back when the low level library calls were added, but I never got around to it. See also #720. I will also add a short blurb on what HDF5 dimension scales are useful for in a reply to this issue for anyone who is unfamiliar.

Here are some current ideas for things to implement. This is currently closer to a train of thought rather than anything set in stone, so I'd welcome feedback on how this could/should be changed:

High-level interface:

Mid-level implementations of low level library calls:

fergu commented 8 months ago

For a bit of context on what dimension scales do for anyone unfamiliar, in order to help the discussion (this blurb is written by me, not from HDF5 docs or anything, so take it with a grain of salt and all that)

HDF5 Dimension Scales are basically just a way to attach coordinate information to a given axis of a dataset. If I am given a new HDF5 file that someone else made, and I want to know the coordinate information associated with an axis of a dataset, I can just query the dimension scale associated with that axis, and it will give me a dataset with the corresponding coordinate information. This is much smoother than trying to infer what other dataset in the file is meant to be the coordinate data for that axis based on names or context or an email from the creator of the file. This additionally allows multiple datasets to share a single dimension scale, with all of those datasets pointing to a single piece of data in the file (as opposed to copies of identical data scattered around the file). In other words, these are another tool in helping make HDF5 files "self-describing".

Practically, dimension scales are just regular HDF5 Datasets with some extra attributes added to track where they are being used. You can write an HDF5 dataset to file, and then use the HDF5 library function h5ds_set_scale() to specify that the dataset is a scale. This adds a few attributes to the new scale to indicate things like the "name" of the scale (which is different than the path of the scale), and a list of datasets that a given scale is attached to. The handy thing about that last point is that attaching a scale to a dataset (using h5ds_attach_scale()) also adds a link to the scale as an attribute of the target dataset. You can read that attribute/link, and it will return an HDF5.Dataset (well, currently an hid_t, but that's what this task aims to fix) directly, without having to find a name/parse a path or anything like that.