NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Document special attributes and mapping #13

Open rly opened 5 months ago

rly commented 5 months ago
_REFERENCE
_EXTERNAL_ARRAY_LINK
_SCALAR

and mappings during translation

bendichter commented 5 months ago

In an effort to promote interoperability, would it be possible to use Kerchunk's method for indicating scalar datasets, which it apparently inherited from netCDF4?

https://github.com/fsspec/kerchunk/blob/6fe1f0aa6d33d856ca416bc13a290e2276d3bdb1/kerchunk/hdf.py#L543-L549

magland commented 5 months ago

In an effort to promote interoperability, would it be possible to use Kerchunk's method for indicating scalar datasets, which it apparently inherited from netCDF4?

https://github.com/fsspec/kerchunk/blob/6fe1f0aa6d33d856ca416bc13a290e2276d3bdb1/kerchunk/hdf.py#L543-L549

This is something we'll need to consider. I'm hesitant to having _ARRAY_DIMENSIONS with phony_dim_x on every dataset... the _SCALAR=True method seems more straightforward. But I do understand that there are benefits of interoperability. Would be helpful to think of a scenario where we'd need that in order to use some tool. I'm hesitant to go down the path where we end up with many attributes supporting all the various projects (kerchunk, lindi, hdmf-zarr)... instead of having logic in the various tools to be able to handle various cases.

rly commented 5 months ago

In an effort to promote interoperability, would it be possible to use Kerchunk's method for indicating scalar datasets, which it apparently inherited from netCDF4?

fsspec/kerchunk@6fe1f0a/kerchunk/hdf.py#L543-L549

Because information about this is scattered throughout issues and docs, I wanted to summarize:

Xarray is a popular Python library for working with labelled multi-dimensional arrays. It reads and writes netCDF4 files by default (these are specially organized HDF5 files). Xarray requires dimension names to work, and it can read/write them from/to netCDF4 and HDF5 files (it uses HDF5 dimension scales). Scalar datasets in Xarray and netCDF4 are indicated by the lack of dimension names. All netCDF4 datasets have dimension names for non-scalar data and lack dimension names for scalar data, so Xarray and netCDF4 are compatible. But not all HDF5 datasets have dimension names. When Xarray loads an HDF5 dataset without dimension names, it generates phony dimension names for them in memory and on write.

Xarray can also read and write Zarr files, but Zarr does not support storing dimension names, so to write Xarray-compatible Zarr files, Xarray defined a special Zarr array attribute: _ARRAY_DIMENSIONS to store the dimension names. On read of Zarr files, it then looks for that attribute. See https://docs.xarray.dev/en/latest/internals/zarr-encoding-spec.html .

Kerchunk, in order to generate Xarray-compatible Zarr files, uses the same convention - it creates the attribute _ARRAY_DIMENSIONS. If the dataset is scalar, then the list is empty; else, it sets the list to be phony dimension names.

So adding the _ARRAY_DIMENSIONS attribute (using phony dim names when no dimension scales are present) allows the Zarr data to be read by Xarray and any other tools that adopt the same convention when reading Zarr data. I see the value of following the same convention, but I am also hesitant to adopt it until we have a need for it.

bendichter commented 5 months ago

Thanks for the summary, @rly! I was not aware of a lot of that.

I do think that Xarray support would be quite valuable, but this may not be the best way to do it. Many of these dataset dimensions really should have names as indicated by the NWB schema.

rly commented 5 months ago

Just to add: The NetCDF group has their own Zarr implementation called NCZarr. It has its own conventions and I think Xarray supports reading both their .zattrs["_ARRAY_DIMENSIONS"] convention and the NCZarr .zarray["_NCZARR_ARRAY"]["dimrefs"] convention. NCZarr can also write Zarr files following the .zattrs["_ARRAY_DIMENSIONS"] convention if mode=xarray. https://github.com/pydata/xarray/issues/6374 .

IMO, this demonstrates the complexity of having too many different conventions and the danger of adding another. https://xkcd.com/927/

For simplicity, I'm still inclined to follow neither convention until Xarray (or netCDF) is within scope, but perhaps that is naive.

When Xarray loads an HDF5 dataset without dimension names, it generates phony dimension names for them in memory and on write.

Just to add: Technically, Xarray doesn't does this, but both the default I/O engine, netcdf4, and the alternate I/O engine for HDF5 files, h5netcdf, do it.

I don't know why Xarray doesn't generate phony dimension names when reading Zarr arrays without dimension names. That would make things easier...

magland commented 5 months ago

Just to add onto this... custom zarr stores are easy to make... and so one can create adapters that attach the various needed attributes for different contexts. For example, you could have simple adapter that adds the _ARRAY_DIMENSIONS on everything where needed. So you'd have

store = ... some store we are working with
store2 = adapter(store)

with no loss of efficiency.

magland commented 5 months ago

This has been documented to an extent here https://github.com/NeurodataWithoutBorders/lindi/blob/main/docs/special_zarr_annotations.md

rly commented 5 months ago

It looks like the Allen Institute for Neural Dynamics would like to use xarray with NWB Zarr files: https://github.com/hdmf-dev/hdmf-zarr/issues/176

magland commented 5 months ago

It looks like the Allen Institute for Neural Dynamics would like to use xarray with NWB Zarr files: hdmf-dev/hdmf-zarr#176

Good to know. As I suggested above, I would propose an adapter that adds the phony_dim _ARRAY_DIMENSIONS attributes, rather than having them in the .zarr.json

oruebel commented 5 months ago

that adds the phony_dim _ARRAY_DIMENSIONS attributes

For the case where array dimensions are unknown, I agree that having a way to emulate them rather than storing invalid information is probably preferable. However, in the case of NWB, we can often know the dimensions from the schema so it would be nice to have those reflected.