NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

handle compound dtypes #19

Closed magland closed 5 months ago

magland commented 5 months ago

@rly this is a proposal.

The goal here is to support datasets with compound dtypes in both LindiH5Store and LindiClient such that the round-trip leaves compound dtypes intact. In particular, you can slice the dataset in the LindiClient in the same way as the h5py slicing of the original.

(btw, I didn't look too carefully at how this is handled in hdmf-zarr, but I think the below proposal is consistent with how things are represented in numpy and hdf5)

To illustrate, here is a test that was added:

def test_compound_dtype():
    print("Testing compound dtype")
    with tempfile.TemporaryDirectory() as tmpdir:
        filename = f"{tmpdir}/test.h5"
        with h5py.File(filename, "w") as f:
            dt = np.dtype([("x", "i4"), ("y", "f8")])
            f.create_dataset("X", data=[(1, 3.14), (2, 6.28)], dtype=dt)
        h5f = h5py.File(filename, "r")
        store = LindiH5Store.from_file(filename, url=filename)
        rfs = store.to_reference_file_system()
        client = LindiClient.from_reference_file_system(rfs)
        X1 = h5f["X"]
        assert isinstance(X1, h5py.Dataset)
        X2 = client["X"]
        assert isinstance(X2, LindiDataset)
        assert X1.shape == X2.shape
        assert X1.dtype == X2.dtype
        assert X1.size == X2.size
        # assert X1.nbytes == X2.nbytes  # nbytes are not going to match because the internal representation is different
        assert len(X1) == len(X2)
        if not _check_equal(X1['x'][:], X2['x'][:]):
            print("WARNING. Arrays for x are not equal")
            print(X1['x'][:])
            print(X2['x'][:])
            raise ValueError("Arrays are not equal")
        if not _check_equal(X1['y'][:], X2['y'][:]):
            print("WARNING. Arrays for y are not equal")
            print(X1['y'][:])
            print(X2['y'][:])
            raise ValueError("Arrays are not equal")
        store.close()

To summarize this test:

Internally this is represented using the _COMPOUND_DTYPE attribute on the zarr array. In the above case, this would be

_COMPOUND_DTYPE: [['x', 'int32'], ['y', 'float64']]

The data for the array is a JSON encoded array of the form [[1, 3.14], [2, 6.28]]

See comments in the source code for more details.

A side note. Ideally, we wouldn't need to use JSON encoding, but rather we could point to the chunks in the original hdf5 file. But that would be a lot more tricky to get to work with zarr. And we're going to want to support special encoding for references anyway. One caveat of JSON encoding inline data approach is that it potentially will cause the .zarr.json file to be very large, in the case of large compound dtype datasets.

rly commented 5 months ago

Later this week / early next week, I'll do a deep-dive comparison between this approach and how it might differ, if at all, from the hdmf-zarr approach. I suspect that the JSON encoding here might result in discrepancies, but if we have the dtype stored in _COMPOUND_DTYPE in the array attrs, then we might be OK.

magland commented 5 months ago

Later this week / early next week, I'll do a deep-dive comparison between this approach and how it might differ, if at all, from the hdmf-zarr approach. I suspect that the JSON encoding here might result in discrepancies, but if we have the dtype stored in _COMPOUND_DTYPE in the array attrs, then we might be OK.

Sounds good. Yeah, I'm not sure that json encoding is the right thing here. Worth thinking about