aldanor / hdf5-rust

HDF5 for Rust
https://docs.rs/hdf5
Apache License 2.0
306 stars 81 forks source link

Request: Reference types #98

Open gauteh opened 4 years ago

gauteh commented 4 years ago

Hi,

while figuring out how to get the dimensions of a dataset I am trying to access STD_REF_OBJ's in attributes (using #65 with changes from @balintbalazs). A simple example of references are generated with this script:

from h5py import File
import numpy as np

# http://docs.h5py.org/en/stable/high/dims.html

f = File('dims_1d.h5', 'w')

f['x1'] = [1, 2]
f['x1'].make_scale('x1 name')
f['data'] = np.ones((2,), 'f')
f['data'].dims[0].attach_scale(f['x1'])

which gives the following structure (using h5dump -A dims_1d.h5):

HDF5 "dims_1d.h5" {
GROUP "/" {
   DATASET "data" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): (DATASET 800 /x1 )
         }
      }
   }
   DATASET "x1" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 8;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "x1 name"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 1400 /data ,
               0
            }
         }
      }
   }
}
}

examples of how to de-reference the reference are here: https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5-examples/browse/1_10/C/H5T/h5ex_t_objrefatt.c with hdf5-reference: https://portal.hdfgroup.org/display/HDF5/H5R_DEREFERENCE. As far as I can see this requires some changes in hdf5-rust in order to access since I do not have direct access to the hid_t of my attribute (in this case). I could try to implement this, but it seems a bit tricky, either way it probably requires some dereferencing to get the actual object. In my case getting the name of the object would be enough at first.

Regards, Gaute

gauteh commented 4 years ago

Seems I can use .id() on objects, never noticed before!

JamesMc86 commented 4 months ago

I'm trying to understand if the description above offers a workaround. I have references in a dataset I need to access, but I think this ID getting will work on an attribute. Is that right? Is there any way to fake a reference data set right now?

If not I'm up for taking a swing at integrating this.

mulimoen commented 4 months ago

I've taken a first shot at implementing this in a branch here [0]. Some notes follows

Feel free to play around using this @JamesMc86

[0] https://github.com/mulimoen/hdf5-rust/tree/feature/refs2

JamesMc86 commented 4 months ago

This looks great and will meet all my requirements in its current format. Just sent you a PR to that branch with some integration tests I used during testing.

There are a couple of notes but none of this block what I need right now:

I'm happy to work on some aspects that I can

For my use case these are all just refinements though and this looks great to me.

JamesMc86 commented 4 months ago

Ah, I just tried the clone implementation, but it actually requires a try_clone method, as it can fail.

JamesMc86 commented 4 months ago

So I've been getting my head around more of this at the HDF layer. Is it right to say the ObjectReference is the pre 1.12 type and then since 1.12 we have the standard reference type which is an opaque type over object, region or attribute references.

Therefore, for backward compatibility, we could have an ObjectReference as we do now, which could then wrap the standard reference for these in the newer versions.

mulimoen commented 4 months ago

I think the different reference types are the old type as you suggest with separate Object and Data region references, which is replaced by a more general reference type H5R_ref_t (which we call StdReference but that is a very poor name). I think we need to support both types since the original issue has the old reference type (H5T_STD_REF_OBJECT).

Ah, I just tried the clone implementation, but it actually requires a try_clone method, as it can fail. You can always panic in the code if the clone fails. We will have to figure out what to do with VarLenArray which contains references, but this can be delayed until a later time.

JamesMc86 commented 4 months ago

That sounds great. Yesterday, I did a bit more on the refactoring at https://github.com/WiresmithTech/hdf5-rust/tree/feature/refs2. I hope you are happy with me plodding away on it. I've separated the two types into their own modules so we can easily gate the new types on the HDF version number.

With the references I'm thinking:

  1. Even with the standard reference type, wrapping it as an Object reference will allow for a more ergonomic API when region references and attribute references are introduced.
  2. Keep the old reference type as you said.
  3. maybe (this is where I would like your advice) - alias them together so new code automatically uses the new reference as it seems to be compatible on disk and purely an API change. If we have made it so it automatically migrates then we don't need to expose both types.

So you end up with something like:

< v1.12

struct ObjectReference {...}

v1.12

struct StdReference{...}
struct ObjectReference(StdReference)

What do you feel about this? The alternatives I've thought about:

mulimoen commented 4 months ago

Great that you are working on this. There is a big feature space to explore. Some of the incompatibility can be resolved by using traits for dereference etc.

The reference types are not compatible on disk unfortunately, and we should keep both and make them both available for backwards compatibility with e.g. NetCDF files. Even though 1.10 (and 1.12) has no new planned releases, the old API is very unlikely to be removed.

JamesMc86 commented 4 months ago

Perfect - I'll assume we need both then and work accordingly.