NeurodataWithoutBorders / nwb-schema

Data format specification schema for the NWB neurophysiology data format
http://nwb-schema.readthedocs.io
Other
53 stars 16 forks source link

add electrodes as optional argument to ecephys.Clustering #194

Closed bendichter closed 4 years ago

bendichter commented 6 years ago

It would be great if ecephys.Clustering could support an optional electrodes argument that would expect an electrode_table_region object.

tjd2002 commented 6 years ago

If you use the Unit table for unit metadata instead of Clustering, then you can just add a column for electrodes (actually an electrodeTableRegion). For instance, some of our Unit columns look like this:

    # External clustering software gives names for each cluster--we want to preserve these
    nwbf.add_unit_column('cluster_name',  '(str) cluster name from clustering software')
    nwbf.add_unit_column('elec_group',    '(electrodeGroup) nTrode on which spikes were recorded')

    # For tetrode data, this will usually be all channels in the tetrode
    nwbf.add_unit_column('neighborhood',  '(electrodeTableRegion) list of electrodes on which spikes were clustered')

Notably, the datatypes (str, electrodeTableRegion,...) are not enforced by the schema in this approach, but that seems like something that could be addressed separately.

tjd2002 commented 6 years ago

Same logic applies to #111, which requests new fields be added to Clustering.

bendichter commented 5 years ago

@tjd2002 The dev team has had a few conversations about what is the most efficient way to store spike data. It basically comes down to 2 options: index by unit (as is done in the Units table) and index by time (as is done in Clustering). Both of these structures are constant wrt the number of units, so the question is which one of them is more efficient for read/write/analysis, and the answer is it depends on what kind of read/write/analysis you want to do. For simulation output, where you have each process write the spikes of a single neuron, indexing by spikes makes more sense. However if you are streaming data in real time, writing to a time-indexed data format makes more sense because you are simply appending to 2 Datasets. It would be possible to stream to Units if you are charitable/hacky with the class definition and allow for each unit id to appear multiple times in the table, but this is at best a clunky solution. Similarly during analysis if you want all the spike times of a single neuron, indexing by unit O(n_units) is more efficient than indexing by time O(n_total_spikes). On the other hand, if you want all of the spikes in a specific time window, indexing by time O(log n_total_spikes) is better than indexing by unit O(n_total_spikes). I think we settled on that the default should be indexing by Unit. It had the added advantage that it allowed us to more cleanly include additional cell-based data, which allowed us to fold in ClusterWaveforms as well. However I don't really want to entirely get rid of Clustering because I think there are real use-cases for it, particularly if you are trying to stream cell categorization in real time. I think if we are going to keep it, it should have the appropriate meta-data for derived ecephys measures that would allow a user to understand what electrodes this data was recorded from.

tjd2002 commented 5 years ago

However I don't really want to entirely get rid of Clustering because I think there are real use-cases for it, particularly if you are trying to stream cell categorization in real time.

The release notes suggest that at the time of the refactoring the intent was to deprecate Clustering, for a whole bunch of nice reasons. ( https://nwb-schema.readthedocs.io/en/latest/format_release_notes.html#improved-storage-of-unit-based-data), and this is still my preference.

There would be a big penalty to pay in terms of both code complexity and user confusion if we have 2 redundant ways of storing a fundamental type of data like spike times. So I think the bar should be pretty high before we go this route.

writing to a time-indexed data format [like 'Clustering'] makes more sense because you are simply appending to 2 Datasets. It would be possible to stream to Units if you are charitable/hacky with the class definition and allow for each unit id to appear multiple times in the table, but this is at best a clunky solution.

Why can't you just append to each unit's list of spike_times as spikes come in? Admittedly this would be slower than batch writing to a long list of merged spikes, but it doesn't seem clunky per se.

Similarly during analysis if you want all the spike times of a single neuron, indexing by unit O(n_units) is more efficient than indexing by time O(n_total_spikes). On the other hand, if you want all of the spikes in a specific time window, indexing by time O(log n_total_spikes) is better than indexing by unit O(n_total_spikes).

On read, we are always selecting on both units and time windows, with # of units typically << 1,000 and the number of spikes per unit also on the order of 1,000s, so it's not obvious that one approach would be faster than the other, and in general lookup times are not really a bottleneck for us.

Since the argument seems to boil down to whether getting rid of Clustering would impose a large performance hit on the streaming use case, I have 2 main questions: 1) is this a real and important use case, and 2) are there ways to mitigate any performance issues without 'forking' the representation of spikes.

One possible solution would be to rename Clustering to something like 'StreamingUnitSpikeTimes', or something else that makes it clear that it's only really there for a very specialized purpose. and then make it clear that in that case, you would still use a Units table to keep track of unit metadata, including electrodes, etc. (to come belatedly back to the topic of this issue :) )

ajtritt commented 5 years ago

If a data structure is needed for streaming spike times, we should design something for that. Clustering came from 1.x as a schema for storing the results from klustakwik. If data is being produced as a stream, it should be a subclass of TimeSeries, which Clustering is not. So, you can't take advantage of any of the properties of it being a TimeSeries. Also, the name "Clustering" is ambiguous.

Regarding the original request in this issue: I think it makes more sense for neurodata_type to have a reference to Units table. Units is the structure for storing metadata about spiking units, including associations between spiking units and electrodes.

I just submitted a PR with such a type specced out.

tjd2002 commented 5 years ago

I like this. Ideally the main API/query calls should (eventually) be agnostic as to which style is used ‘under the hood’.

E.g. if you just called add_unit_spike_time (or similar), the data will get written correctly. There could be an optional flag that the user could set if they cared which approach was used.

rly commented 4 years ago

Clustering is deprecated.