handle compound dtypes - Githubissues

magland commented 5 months ago

@rly this is a proposal.

The goal here is to support datasets with compound dtypes in both LindiH5Store and LindiClient such that the round-trip leaves compound dtypes intact. In particular, you can slice the dataset in the LindiClient in the same way as the h5py slicing of the original.

(btw, I didn't look too carefully at how this is handled in hdmf-zarr, but I think the below proposal is consistent with how things are represented in numpy and hdf5)

To illustrate, here is a test that was added:

def test_compound_dtype():
    print("Testing compound dtype")
    with tempfile.TemporaryDirectory() as tmpdir:
        filename = f"{tmpdir}/test.h5"
        with h5py.File(filename, "w") as f:
            dt = np.dtype([("x", "i4"), ("y", "f8")])
            f.create_dataset("X", data=[(1, 3.14), (2, 6.28)], dtype=dt)
        h5f = h5py.File(filename, "r")
        store = LindiH5Store.from_file(filename, url=filename)
        rfs = store.to_reference_file_system()
        client = LindiClient.from_reference_file_system(rfs)
        X1 = h5f["X"]
        assert isinstance(X1, h5py.Dataset)
        X2 = client["X"]
        assert isinstance(X2, LindiDataset)
        assert X1.shape == X2.shape
        assert X1.dtype == X2.dtype
        assert X1.size == X2.size
        # assert X1.nbytes == X2.nbytes  # nbytes are not going to match because the internal representation is different
        assert len(X1) == len(X2)
        if not _check_equal(X1['x'][:], X2['x'][:]):
            print("WARNING. Arrays for x are not equal")
            print(X1['x'][:])
            print(X2['x'][:])
            raise ValueError("Arrays are not equal")
        if not _check_equal(X1['y'][:], X2['y'][:]):
            print("WARNING. Arrays for y are not equal")
            print(X1['y'][:])
            print(X2['y'][:])
            raise ValueError("Arrays are not equal")
        store.close()

To summarize this test:

We create an hdf5 file with an array X with compound datatype and fields x and y of types i4 and f8 respectively.
We wrap that in a LindiH5Store -- so we get a zarr array. The compound dtype array is represented using a special attribute (described below)
We then generate a reference file system and load that into a LindiClient. This represents the compound dtype array in the same was as the h5py client on the original file.

Internally this is represented using the _COMPOUND_DTYPE attribute on the zarr array. In the above case, this would be

_COMPOUND_DTYPE: [['x', 'int32'], ['y', 'float64']]

The data for the array is a JSON encoded array of the form [[1, 3.14], [2, 6.28]]

See comments in the source code for more details.

A side note. Ideally, we wouldn't need to use JSON encoding, but rather we could point to the chunks in the original hdf5 file. But that would be a lot more tricky to get to work with zarr. And we're going to want to support special encoding for references anyway. One caveat of JSON encoding inline data approach is that it potentially will cause the .zarr.json file to be very large, in the case of large compound dtype datasets.

rly commented 5 months ago

Later this week / early next week, I'll do a deep-dive comparison between this approach and how it might differ, if at all, from the hdmf-zarr approach. I suspect that the JSON encoding here might result in discrepancies, but if we have the dtype stored in _COMPOUND_DTYPE in the array attrs, then we might be OK.

magland commented 5 months ago

Later this week / early next week, I'll do a deep-dive comparison between this approach and how it might differ, if at all, from the hdmf-zarr approach. I suspect that the JSON encoding here might result in discrepancies, but if we have the dtype stored in _COMPOUND_DTYPE in the array attrs, then we might be OK.

Sounds good. Yeah, I'm not sure that json encoding is the right thing here. Worth thinking about

NeurodataWithoutBorders / lindi

handle compound dtypes #19