dim scales: Enhance specification language and containers

oruebel commented 5 years ago

1) Feature Request

Problem: Currently the schema does not support direct association of datasets with dimensions of another datasets

Proposed Solution

Part 1: Enhance the specification language: Enhance the dims key in the specification language to not just allow a string to label a dimension but to provide a dict of the form

dims: 
   - - label: time
        scale: timestamps
      - label: electrodes
        scale: electrodes   
   - 
   ....

Here the scale key would refer to a dataset with a fixed name located within the same Group. In this way, we do not need to add complex cross-referencing mechanisms to the schema and requiring that scales should be in the same location as the data they label seems appropriate. The scale key should be allowed to be a list to enable multiple scales for a single axis. Also, by using a reference to a table as a dimension scale, one can also encode complex annotations and labels in a scale.

Part 2: Enhance the container classes In the API, a user should be able to request a list of all the scales for a particular dataset. For primary data containers (e.g., TimeSeries) this could be a function on the container class (e.g., get_scales or dims) that would return an OrderedDict, of dicts {'time': {'timestamps': timestamps, ... }, ...}.

Problem/Use Case

For visualization, analysis, and query it is useful to be able to programmatically inspect the dimensions of an array in a consistent fashion. In this way, we can, e.g., automatically label axes in plots or query the dimensions of an array without having to know the specifics of containers

Checklist

[X] Have you ensured the feature or change was not already reported ?
[X] Have you included a brief and descriptive title?
[X] Have you included a clear description of the problem you are trying to solve?
[ ] Have you included a minimal code snippet that reproduces the issue you are encountering?

bendichter commented 5 years ago

If we are not going to yoke label with scale (which I'm fine with) I'd actually rather this go in 'shape'. I propose that you specify this by

shape:
    - 2
    - electrodes

Note that the current syntax of

shape:
    - 2
    - null

is still valid. "null" (or blank) will be a special string that means "no scale matching." For all other non-int strings, dimensions of different datasets with matching string shape parameters must match.

How broad of a scope do we want this to be enforced? I think it should be enforced within a Group but not for the entire file. That will allow us to build this into docval, and I am worried that a file-wide scope will cause problems.

oruebel commented 5 years ago

Just to clarify, this is an issue for 2.x and not something we are planing to do for the release. I only created the issue because it came up as a need for us and I want to create a record so I don't forget.

How broad of a scope do we want this to be enforced?

This should be limited to the scope of a Group/Dataset. This is in part to avoid having to support complex referencing of objects across the whole schema, and also for the purpose of "scales" allowing group-level only is sufficient.

If we are not going to yoke label ...

The label is to name the dimension (not the scale). The name of the scale should come from the scale dataset. Also, scale needs to be a list or dict, in order to allow multiple scales to be associated with the same dimension.

I'd actually rather this go in 'shape'.

The main purpose of this is to have a formal mechanism to connect a dataset as a scale to the dimension of a dataset. I don't think shape is the right place for this. shape defines the size of the dataset whereas dims is for describing the dimension, which is what scales do.

For all other non-int strings, dimensions of different datasets with matching string shape parameters must match.

The issue of enforcing matching shapes between datasets to me is a different (and larger) problem.

There are cases where two datasets should have matching shapes but are not scales (e.g,. in the context of matrix factorizations, dimensions of the matrices must match but they are not scales).
Matching shapes is a runtime problem, i.e., you often don't know the exact length but you only know what needs to match.
To match shape we need to know not just the datasets but also which dimensions need to match in what way. E.g., you may have two datasets, one is m x n and the other is n x m (i.e., the ordering of the dimensions that need to match changes) or you may have an m x m datasets (i.e., now we need to describe matching shape within a dataset). I don't think scales are the right mechanisms to express this.

Ultimately, I think we need a different mechanisms to describe relationships between datasets in the schema. I agree that that is an important issue that we should look at but I don't think we should mix it in with the issue of scales.

bendichter commented 5 years ago

ah ok I was confused about what problem this was solving. Having scales is a good idea. In TimeSeries would timestamps be a scale of data?

For size matching, I think the solution I proposed above solves issue (3), but if we want to talk about that let's move it to another issue.

oruebel commented 5 years ago

Would in TimeSeries would timestamps be a scale of data?

Yes

bendichter commented 5 years ago

gotcha, OK. I think using scales will help us be more explicit about the data. Was there a specific motivating example that inspired this issue?

oruebel commented 5 years ago

Was there a specific motivating example that inspired this issue?

The motivation for this was the need for tools on top of PyNWB to be able to figure out dimensions programmatically.

tjd2002 commented 5 years ago

As we start to build query/analysis tools we are running into this need more explicitly.

Just as an example, for columnar TimeSeries data (like a SpatialSeries), it sure would be nice to refer to the data columns as ['pos_x'] and ['pos_y'] rather than [0] and [1].

Currently this ordering must be inferred from the docs for pynwb.behavior.Position, which read: "Position data, whether along the x, x/y or x/y/z axis."


# Store some position (x,y) data:
pos = pynwb.behavior.Position(spatial_series=[], 
                              name='Position')

pos.create_spatial_series(name='Position d4', 
                              timestamps = timestamps,
                              data=my_pos_dataframe.[['x', 'y']] * m_per_pixel,
                              reference_frame='top left corner of video frame',
                              unit='m')

Then on read, we manually re-label the columns

position_nwb = nwbf.modules['Behavior']['Position']['Position d4']
position_dataframe = pd.DataFrame(data=position_nwb.data[()], columns=['pos_x', 'pos_y'])

Facilities for labeling columns already exist in hdf5 (and xarray, pandas,...), it seems it should be straightforward to implement once we decide on the right way to express things in the spec.

tjd2002 commented 5 years ago

(Arguably, for tabular/columnar data, we should just use a pynwb.DynamicTable, which already captures a name and free-text description for each column...)

bendichter commented 5 years ago

I don't know about the prolific use of DTs, but I agree explicit labels for dimensions in pynwb would be really nice. This would require changes to the schema, pynwb, and matnwb, so it's a decent chunk of work. Maybe we can add this to the HCK06 potential project list for developers.

tjd2002 commented 5 years ago

@bendichter Agreed about this being a good hackathon project; good idea.

explicit labels for dimensions in pynwb would be really nice

Let's be careful to distinguish 'labels for dimensions' (string dimension labels already exist in the spec, though they don't seem to get written into the .nwb file), from Oliver's proposal here, which is to provide 'dimension scales'

xarray calls these two concepts 'dims' and 'coords', respectively: http://xarray.pydata.org/en/stable/data-structures.html hdf5 calls them 'dimension labels' and 'dimension scales': http://docs.h5py.org/en/stable/high/dims.html

oruebel commented 5 years ago

@tjd2002 good point. Yes, dimension labels we could already get from the current schema the information about scales unfortunately not yet. If you are just interested in the labels, then I think we should probably put that in a separate issue to avoid confusion in the discussions. I.e., use this issue for the scales and the other issue for labels.

tjd2002 commented 5 years ago

See issue #816 for dim labels

Sent from my phone

On Feb 7, 2019, at 3:10 PM, Oliver Ruebel notifications@github.com wrote:

@tjd2002 good point. Yes, dimension labels we could already get from the current schema the information about scales unfortunately not yet. If you are just interested in the labels, then I think we should probably put that in a separate issue to avoid confusion in the discussions. I.e., use this issue for the scales and the other issue for labels.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

NeurodataWithoutBorders / pynwb