Open oruebel opened 5 years ago
If we are not going to yoke label with scale (which I'm fine with) I'd actually rather this go in 'shape'. I propose that you specify this by
shape:
- 2
- electrodes
Note that the current syntax of
shape:
- 2
- null
is still valid. "null" (or blank) will be a special string that means "no scale matching." For all other non-int strings, dimensions of different datasets with matching string shape parameters must match.
How broad of a scope do we want this to be enforced? I think it should be enforced within a Group but not for the entire file. That will allow us to build this into docval, and I am worried that a file-wide scope will cause problems.
Just to clarify, this is an issue for 2.x and not something we are planing to do for the release. I only created the issue because it came up as a need for us and I want to create a record so I don't forget.
How broad of a scope do we want this to be enforced?
This should be limited to the scope of a Group/Dataset. This is in part to avoid having to support complex referencing of objects across the whole schema, and also for the purpose of "scales" allowing group-level only is sufficient.
If we are not going to yoke label ...
The label
is to name the dimension (not the scale). The name of the scale should come from the scale dataset. Also, scale needs to be a list or dict, in order to allow multiple scales to be associated with the same dimension.
I'd actually rather this go in 'shape'.
The main purpose of this is to have a formal mechanism to connect a dataset as a scale to the dimension of a dataset. I don't think shape
is the right place for this. shape
defines the size of the dataset whereas dims
is for describing the dimension, which is what scales do.
For all other non-int strings, dimensions of different datasets with matching string shape parameters must match.
The issue of enforcing matching shapes between datasets to me is a different (and larger) problem.
m x n
and the other is n x m
(i.e., the ordering of the dimensions that need to match changes) or you may have an m x m
datasets (i.e., now we need to describe matching shape within a dataset). I don't think scales are the right mechanisms to express this. Ultimately, I think we need a different mechanisms to describe relationships between datasets in the schema. I agree that that is an important issue that we should look at but I don't think we should mix it in with the issue of scales.
ah ok I was confused about what problem this was solving. Having scales is a good idea. In TimeSeries
would timestamps
be a scale of data
?
For size matching, I think the solution I proposed above solves issue (3), but if we want to talk about that let's move it to another issue.
Would in
TimeSeries
wouldtimestamps
be a scale ofdata
?
Yes
gotcha, OK. I think using scales will help us be more explicit about the data. Was there a specific motivating example that inspired this issue?
Was there a specific motivating example that inspired this issue?
The motivation for this was the need for tools on top of PyNWB to be able to figure out dimensions programmatically.
As we start to build query/analysis tools we are running into this need more explicitly.
Just as an example, for columnar TimeSeries data (like a SpatialSeries), it sure would be nice to refer to the data columns as ['pos_x'] and ['pos_y'] rather than [0] and [1].
Currently this ordering must be inferred from the docs for pynwb.behavior.Position, which read: "Position data, whether along the x, x/y or x/y/z axis."
# Store some position (x,y) data:
pos = pynwb.behavior.Position(spatial_series=[],
name='Position')
pos.create_spatial_series(name='Position d4',
timestamps = timestamps,
data=my_pos_dataframe.[['x', 'y']] * m_per_pixel,
reference_frame='top left corner of video frame',
unit='m')
Then on read, we manually re-label the columns
position_nwb = nwbf.modules['Behavior']['Position']['Position d4']
position_dataframe = pd.DataFrame(data=position_nwb.data[()], columns=['pos_x', 'pos_y'])
Facilities for labeling columns already exist in hdf5 (and xarray, pandas,...), it seems it should be straightforward to implement once we decide on the right way to express things in the spec.
(Arguably, for tabular/columnar data, we should just use a pynwb.DynamicTable, which already captures a name and free-text description for each column...)
I don't know about the prolific use of DTs, but I agree explicit labels for dimensions in pynwb would be really nice. This would require changes to the schema, pynwb, and matnwb, so it's a decent chunk of work. Maybe we can add this to the HCK06 potential project list for developers.
@bendichter Agreed about this being a good hackathon project; good idea.
explicit labels for dimensions in pynwb would be really nice
Let's be careful to distinguish 'labels for dimensions' (string dimension labels already exist in the spec, though they don't seem to get written into the .nwb file), from Oliver's proposal here, which is to provide 'dimension scales'
xarray calls these two concepts 'dims' and 'coords', respectively: http://xarray.pydata.org/en/stable/data-structures.html hdf5 calls them 'dimension labels' and 'dimension scales': http://docs.h5py.org/en/stable/high/dims.html
@tjd2002 good point. Yes, dimension labels we could already get from the current schema the information about scales unfortunately not yet. If you are just interested in the labels, then I think we should probably put that in a separate issue to avoid confusion in the discussions. I.e., use this issue for the scales and the other issue for labels.
See issue #816 for dim labels
Sent from my phone
On Feb 7, 2019, at 3:10 PM, Oliver Ruebel notifications@github.com wrote:
@tjd2002 good point. Yes, dimension labels we could already get from the current schema the information about scales unfortunately not yet. If you are just interested in the labels, then I think we should probably put that in a separate issue to avoid confusion in the discussions. I.e., use this issue for the scales and the other issue for labels.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
1) Feature Request
Problem: Currently the schema does not support direct association of datasets with dimensions of another datasets
Proposed Solution
Part 1: Enhance the specification language: Enhance the
dims
key in the specification language to not just allow a string to label a dimension but to provide a dict of the formHere the
scale
key would refer to a dataset with a fixed name located within the sameGroup
. In this way, we do not need to add complex cross-referencing mechanisms to the schema and requiring that scales should be in the same location as the data they label seems appropriate. Thescale
key should be allowed to be a list to enable multiple scales for a single axis. Also, by using a reference to a table as a dimension scale, one can also encode complex annotations and labels in a scale.Part 2: Enhance the container classes In the API, a user should be able to request a list of all the scales for a particular dataset. For primary data containers (e.g., TimeSeries) this could be a function on the container class (e.g.,
get_scales
ordims
) that would return an OrderedDict, of dicts {'time': {'timestamps': timestamps, ... }, ...}.Problem/Use Case
For visualization, analysis, and query it is useful to be able to programmatically inspect the dimensions of an array in a consistent fashion. In this way, we can, e.g., automatically label axes in plots or query the dimensions of an array without having to know the specifics of containers
Checklist