NeurodataWithoutBorders / nwb-schema

Data format specification schema for the NWB neurophysiology data format
http://nwb-schema.readthedocs.io
Other
52 stars 16 forks source link

new neurodata_type: TTL pulse #306

Open bendichter opened 4 years ago

bendichter commented 4 years ago

It would be nice to have an intuitive and machine-readable neurodata_type for TTL pulses. Something like

neurodata_type_def: TTL
neurodata_type_ind: TimeSeries
datasets:
  - name: data
    shape:
      - None
    dims:
      - pulses
    dtype: int
 - name: keys
   shape:
     - None
   dims:
     - unique_labels
   dtype: int
 - name: labels
   shape:
     - None
   dims:
     - unique_labels
   dtype: text
bendichter commented 4 years ago

@rly when you return I'd be interested in your thoughts

t-b commented 4 years ago

Jeep, that could be useful. For what would you use the labels and keys?

bendichter commented 4 years ago

The labels would give you a human readable description of what the events were. The keys would map those labels to values

t-b commented 4 years ago

@bendichter But you could use epochs for that as well or?

ajtritt commented 4 years ago

Sounds like you’re describing a CV, which is usually represented with a table. I’d recommend just using a DynamicTable for the mapping. And the data could be stored as a DynamicTableRegion. I’m not sure how that would play with the TimeSeries.data tho

bendichter commented 4 years ago

what do you mean by CV?

ajtritt commented 4 years ago

controlled vocabulary.

bendichter commented 4 years ago

Oh I see. Yeah very similar but I think the one difference is that the actual value of the int is important to the labs as well as the label.

ajtritt commented 4 years ago

That might not be a CV then. Are the values of the pulses enumerable?

bendichter commented 4 years ago

Enumerable? The ints are chosen by the researcher to signal specific events. There is a finite number of them and they generally are not sequential.

rly commented 4 years ago

Although TTL pulses are a sequence of events in time, I do not think inheriting this type from TimeSeries is most appropriate here. TTL pulses are not generic time series measurements (the data don't have units) and the timestamps are almost never uniformly sampled in time. It would be more appropriate to use or alias the LabeledEvents type https://github.com/NeurodataWithoutBorders/nwb-schema/pull/302/files which is similar to what you have here ("keys" = "labels", "labels" = "label_map")

In my experience, users sometimes work with all of the TTL pulses together in order of time, but more often they separate the pulses by value and work with only a subset of them, often in isolation (peri-stimulus time histogram), and sometimes together in order. Those use cases would be best served by two different storage options:

  1. as @bendichter described, store a single array of N (timestamp, value) pairs
  2. store K 1-D arrays of timestamps, where K is the number of unique pulse values.

For Option 1, if the user wants to get the data in the form of Option 2 (i.e. split the array by value into K 1-D arrays), a naive implementation takes O(NK) time - I'm not sure this can be done faster.

For Option 2, if the user wants to get the data in the form of Option 1 (i.e. create a single array of all of the pulse values in order of time), then they would have to do a K-way merge sort of the K 1-D arrays, which is O(N log K) time.

Given that difference in efficiency (at the limit) and that I think working with only one or a subset of pulse values is more common than working with all pulse values in order, I think Option 2 would be better.

ajtritt commented 4 years ago
  1. store K 1-D arrays of timestamps, where K is the number of unique pulse values.

Could we store this in the same way we store spike units?

rly commented 4 years ago

Yes, a DynamicTable would work, but for some reason, it feels less intuitive to me (same for the units table). It's a little overkill, but it would organize the TTL pulses together. Granted, the user does not need to know (and to some degree should not need to know) the low-level details of how the data is structured.

Ultimately, I think that TTL pulses, spike unit times, and events in general, could, and probably should, be stored in the same way as each other. So, yes, option 3:

  1. Store TTL pulses as a DynamicTable with indexed column pulse_times, and normal column label.
bendichter commented 4 years ago

I worry about the proliferating use of DynamicTables, because it sort of side-steps the whole point of NWB:N, which is to provide good standards for meta-data. DynamicTable is basically an open door that allows you to store any type of data you want with minimal constraints. Any open door will lead to the temptation to use it for every datatype that is not currently supported, and I worry about the long term effects of that. This is useful for purely-meta-data structures like trials, where this flexibility is important and adherence to a strict standard is less important, but I worry when they are used for annotating acquired data. When they are used to annotate data, DynamicTables on their own are invariably insufficient for the chosen type of data and we end up having to compensate with hacks. Even with Units there is an issue with storing the sampling rate of the waveforms and with linking individual spikes to snippets. The only way to resolve this flexibly would be to allow DTs to have global attributes, and at that point they would not enforce any structure at all!

I agree that we need a way to express time series objects that are not measurements and are unitless. I think the proposed Events and/or LabelledEvents addresses this problem nicely, and I like that it generalizes to cases outside of TTL pulses because this has come up a few times and will continue to be needed. The one issue I have with the proposed extension is the names of the datasets. To me, label should be a string, i.e. the value of the dictionary, not the key.

Then there is the question of whether we should store a one object that captures all TTL pulses or several objects that each store a TTL pulse event type. This comes down to whether we want to make the data closer to what is acquired or closer to the semantic meaning/how it will be used an analysis. In this case I vote to make it closer to acquired with LabelledEvents because I imagine TTL pulses going in /acquisition. If users want to pull out specific events, we could encourage them to do that with Events objects in a processing module.

rly commented 4 years ago

@bendichter Coming back to this, when you wrote:

When they are used to annotate data, DynamicTables on their own are invariably insufficient for the chosen type of data and we end up having to compensate with hacks. Even with Units there is an issue with storing the sampling rate of the waveforms and with linking individual spikes to snippets. The only way to resolve this flexibly would be to allow DTs to have global attributes, and at that point they would not enforce any structure at all!

What do you mean by global attributes? I think it is fine for subtypes of DynamicTables to define attributes that apply to all rows of the table.

rly commented 4 years ago

What is the issue with linking individual spikes to snippets? I saw a bunch of issues/PRs on UnitSeries and it looked like a reasonable solution was agreed upon https://github.com/NeurodataWithoutBorders/nwb-schema/issues/248