dimension labels ('dims' field in spec) are not used by pynwb

tjd2002 commented 5 years ago

The spec allows for the labeling of dimensions of datasets in the spec, but these are not used by pynwb.

For example, the spec for a behavior.Position group includes the dimension labels 'num_times' and 'num_features' for 2-D data: https://github.com/NeurodataWithoutBorders/pynwb/blob/f21aa5d844a737fbb7773773b477d3868495d82d/src/pynwb/data/nwb.behavior.yaml#L33-L36

But these do not appear to be written into the resulting nwb file by pynwb.

One option would be for pynwb to write these to the 'dims' fields of the HDF5 dataset, however this is not currently done.

import pynwb
import numpy as np
from datetime import datetime
from dateutil import tz

nwb_filename = './test.nwb'

nwbf=nwbf = pynwb.NWBFile(session_description='example session',
                         identifier='example1',
                         session_start_time=datetime.now(tz.tzlocal()))
pos = pynwb.behavior.Position(spatial_series=[], 
                              name='Position')
pos.create_spatial_series(name='Position Data', 
                              timestamps = np.arange(5),
                              data=np.ones((5,2)),
                              reference_frame='top-left corner of video frame')
behav_mod = nwbf.create_processing_module(name='Behavior', 
                                          description='Behavioral variables')
behav_mod.add_data_interface(pos)

with pynwb.NWBHDF5IO(nwb_filename, mode='w') as iow:
    iow.write(nwbf)

io = pynwb.NWBHDF5IO(nwb_filename, mode='r')
nwbf_read = io.read()

print('*** SpatialSeries before writing (data is np.array), no dim labels')
print(nwbf_read.modules['Behavior']['Position']['Position Data'])
print()
print('*** SpatialSeries as read from nwb file (data is HDF5 dataset), no dim labels')
print(nwbf_read.modules['Behavior']['Position']['Position Data'])
print()
print('*** HDF5 "dims" fields are empty')
print(nwbf_read.modules['Behavior']['Position']['Position Data'].data.dims)
for dim in nwbf_read.modules['Behavior']['Position']['Position Data'].data.dims: 
    print(dim)
io.close()

results:

*** SpatialSeries before writing (data is np.array), no dim labels

Position Data <class 'pynwb.behavior.SpatialSeries'>
Fields:
  comments: no comments
  conversion: 1.0
  data: <HDF5 dataset "data": shape (5, 2), type "<f8">
  description: no description
  interval: 1
  num_samples: 5
  reference_frame: top-left corner of video frame
  resolution: 0.0
  timestamps: <HDF5 dataset "timestamps": shape (5,), type "<f8">
  timestamps_unit: Seconds
  unit: meters

*** SpatialSeries as read from nwb file (data is HDF5 dataset), no dim labels

Position Data <class 'pynwb.behavior.SpatialSeries'>
Fields:
  comments: no comments
  conversion: 1.0
  data: <HDF5 dataset "data": shape (5, 2), type "<f8">
  description: no description
  interval: 1
  num_samples: 5
  reference_frame: top-left corner of video frame
  resolution: 0.0
  timestamps: <HDF5 dataset "timestamps": shape (5,), type "<f8">
  timestamps_unit: Seconds
  unit: meters

*** HDF5 "dims" fields are empty
<Dimensions of HDF5 object at 4603206384>
<"" dimension 0 of HDF5 dataset at 4603206384>
<"" dimension 1 of HDF5 dataset at 4603206384>

bendichter commented 5 years ago

Thanks for getting this conversation going again. In addition to the changes to pynwb, this will require a change to the specification of how the schema is mapped to the HDF5 files. This will also require some changes to matnwb to preserve pynwb readability of files written using matnwb. For these reasons, I think this issue should be in the nwb-schema repo. Do you mind if I move it @tjd2002?

tjd2002 commented 5 years ago

I think I disagree with moving it. The schema already describes the ‘dims’ field as labels on the dimensions of (nwb) datasets. The question is just how pynwb should expose it.

https://schema-language.readthedocs.io/en/latest/specification_language_description.html#dims

tjd2002 commented 5 years ago

Ah now I see what you mean. matnwb would need to store/read this the same way.

I am not wedded to attaching these to the HDF5 dataset. Maybe the more natural thing would be to just save the list of dim labels the same way we save other attributes.

In any case, I can see this spans both APIs, so I’m fine with moving it.

bendichter commented 5 years ago

Yeah that seems like a good idea and is probably easy to implement, though it will require some changes across the board (matnwb, pynwb, nwb-schema). I think it will be possible to do this without breaking backward compatibility if we are careful.

tjd2002 commented 5 years ago

As this gets implemented, it may be worth thinking ahead to the implementation of 'dimension scales' as proposed at NeurodataWithoutBorders/pynwb#626 as well. Given how nicely dim labels and dim scales map onto capabilities of HDF5 datasets, it would seem natural to use those to store them.

If we do implement this by just adding HDF5-native dimension labels to datasets, I think we would get backwards compatibility for free. Currently pynwb doesn't use the dimension labels of its HDF5 datasets for anything; any older code should just ignore those labels if they appear.

oruebel commented 5 years ago

Attaching things in HDF5 is useful, in particular for MatNWB and folks that look at the raw HDF5.

However, I think for PyNWB it may be useful not to just strictly rely on HDF5 for this. I.e., in addition to writing the information to the HDF5 file, I think it may be useful to put this sort of functionality on the level of the container classes as this will allow you to get information about dimensions names (and scales) independent of whether the file has been read from HDF5 or not. E.g., by relying purely on HDF5 this means that you cannot get this information until after a file has been written and it may also not work as other backends are being integrated with PyNWB. However, since all the information is in the schema, we should be able to make this accessible via the container classes and possibly create the information in the ObjectMapper based on the schema.

Ultimately, it would be nice to have a singular representation for arrays in PyNWB (rather than allowing lists, h5py.Dataset, numpy etc.) so that this sort of thing could be handled consistently on a single array level. However, creating such an array representation to rule them all is a major undertaking in itself and a rabbit whole that I don't think we want to go down just yet.

NeurodataWithoutBorders / nwb-schema

dimension labels ('dims' field in spec) are not used by pynwb #247