G-Node / nix

Neuroscience information exchange format
https://readthedocs.org/projects/nixio/
Other
67 stars 36 forks source link

Concepts for "identified" dimensions #522

Open gicmo opened 9 years ago

gicmo commented 9 years ago

Currently there is no way to express that two dimensions are identical, i.e. they are the very same time (as in experimental time) or same spatial region, etc. pp. One way to achieve this is of course shared dimensions (cf. issue #519). But this is not sufficient, if one has for example the same time, but regularly sampled (SampledDimension for voltage traces and at the same time (heh!) irregularly sampled (RangeDimension) for spike trains. Having a way to find identical (identified?) dimensions would allow us to e.g. automatically group data together for plotting. It would also help with dimensions constraints for groups (c.f. #521) if we choose to implement them.

matham commented 7 years ago

This is something that would be pretty useful. Initially, before looking closer at what tags are exactly, I thought this would be the job of tags. But then I saw that tags is more concerned with pointing to indices, i.e. tagging them.

What do you think about "reverse tags". That is instead of just dimensions, allow dimensions to be tags that point to a dimension. Imagine we have something like you describe of a SampledDimension and a few RangeDimension. Then instead of RangeDimension we'd create a tag that acts like a dimension that refs certain time points of the SampledDimension (using position_at https://github.com/G-Node/nixpy/blob/master/nixio/pycore/dimensions.py#L67 internally) and add that as a dimension.

So instead of doing tag.references.add(data), you'd do e.g. data.append_range_dimension(tag) and the tag dimension internally would be added as a ref to the tag, which is why I called it reverse tagging. From an API pov I think this makes sense, although I'm not sure how practical this is to do from the way the h5 files are structured internally.

jgrewe commented 7 years ago

Hi @matham, thanks for your comments and the push you are giving us to think about these issues again. This is an interesting idea but I'm not sure if I understand the reverse tag correctly.

If I get it right, the situation would be like this:

DataArray1 - some data --SampledDimension

DataArray2 - some events in DataArray1 --RangeDimension (tag)

Tag - for linking DataArray1 and DataArray2 --positions DataArray2 --references DataArray1

We could express that the times in DataArray1 and 2 are the same, respectively a selection. DataArray2 would contain indices to the times in the SampledDimension of DataArray1?

In the "classical" way we would have something like this.

'DataArray1 - some data SampledDimension

DataArray2 - some event times --AliasRangeDimension() - indicates that the content of DataArray2 itself can be understood as a dim descriptor

Tag - actually a MultiTag to link the positions in DataArray2 to the respective times in DataArray1 --positions -> DataArray2 --references -> DataArray1

matham commented 7 years ago

I guess the situation I had in mind was a bit different than what you described.

Imagine a setup where you save other associated data, e.g. say you save an electrode voltage at some sampling rate, you also record the (integer) state of some stimulus, e.g. light intensity at the same sampling rate and you also record the flow rate of a pump whenever it changes, but also whenever the flow rate changes you also record the temperature of something.

In the simplest way you'd create 4 data sets:

ElectrodeVoltageArray  # floats32
--SampledDimension1

LightLevelArray  # ints
--SampledDimension2

PumpArray  # floats64
--RangeDimension1

TempArray  # floats32
--RangeDimension2

The basic problem is that SampledDimension with identical values is repeated twice. Similarly, the RangeDimension is repeated twice with identical values. For very large recordings where you may record say 10 things with different data types but with the same RangeDimension values you'd end up with a lot of wasted space.

The simplest solution is to allow creating proxies or refs to a dimension which could be used as a dimension. In which case the example above would become:

ElectrodeVoltageArray  # floats32
--SampledDimension1

LightLevelArray  # ints
--RefSampledDimension  # to ElectrodeVoltageArray's SampledDimension1

PumpArray  # floats64
--RangeDimension1

TempArray  # floats32
--RefRangeDimension  # to PumpArray's RangeDimension1

The example I gave in the previous post dealt with a bit different of a problem although I'm not too sure it's very useful. As far as I understood it (multi)tags basically are conceptually events indicating something that happened at certain positions (times) stored in positions. Like the nose in an image or drug delivery times in e-phys. Actually, in nixpy there doesn't seem to be mention of a AliasRangeDimension, but basically the event times, I assume, come from the times associated with the positions values although I'm not completely sure how that works for things that do and don't have the concept of time. You also seem to say the positions are actual time values, but I thought they were indices and the times comes from the times associated with the indices by looking up the data in refrences.

What I had in mind was the addition of tagging of time or dim itself. That is rather than tagging the positions, we tag the times or practically, references would store a dimension instance and positions would store indices that point into the dimension instance and the resulting tag would itself be a dimension type. This would primarily be useful when you wanted to tag a dataset, but also provide values for each of the tag positions. Imagine we have an electrode recording as well as the name of the valve that was turned ON at specific times, using the electrode time dimension as the common clock. It may look like something like:

ElectrodeVoltageArray
--SampledDimension1

ValveArray
--TaggedRangeDimension
----positions -> list of indices into SampledDimension1
----references -> SampledDimension1 (or maybe even ElectrodeVoltageArray)

The difference is that now each tag position can also take a value which you cannot do now with tags. Essentially this will convert ordinary arrays into potential tags although maybe tags is not the right word for this. But as I said, I'm not really sure this part is a good idea or super useful.