Improve default dimensions in feature grid view

rossant commented 9 years ago

@nsteinme what should be the x and y dimensions in subplot (i, j)? Previously, x=dim_j, y=dim_i, now this is reversed. I can find arguments for both ways. WDYT?

Also, maybe we could show different dimensions in (i, j) and (j, i) (currently, the plots are just symmetric)

nsteinme commented 9 years ago

Hm. Definitely it makes sense to show different dimensions in (i,j) vs (j,i) rather than just mirroring. Are you saying that it currently picks just one channel, and shows you PC1, PC2, and PC3 from that channel in column 1, 2, and 3? And the same for a different channel in the rows?

Well, I don't want to claim I'm certain I have the best answer. But here's how I'd do it if I were doing it from scratch:

(1,1) = time vs. depth (where depth is a feature computed for each spike - did I request this as a feature? I think it's a really good idea).

Then I would keep the x axis of all first-row and first-column plots as Time (like it currently is but keeping time on x rather than y, which I think is much easier to look at.

Column 2, 3, 4 = channel x, y, and z where those are the best channels for the first selected cluster.

Then you plot PC1, PC1, PC2, and PC3 for each of the four rows in that column (all on the y-axes)

Row 2, 3, 4 = channel m, n, p where those are the best channels for the second selected. Then PC1, PC1, PC2, and PC3 in the four columns of that row (all on the x-axes).

So for example plot (3, 1) is channelN PC1 vs Time. Plot (3,2) is ChannelN PC1 [again] vs ChannelX PC2. Plot (4,4) is channel P PC3 vs channel Z PC3. etc.

This gives you 18 different features to look at, plus six different PC1vsTime plots.

It could know to pick a different channel for m, n, or p if those were already selected in x, y, and z.

Just one idea! Just thinking how to maximize information there.

There could be another "mode" that would be very useful in which all plots were PC1 vs Time, and the plots were arranged by their spatial layout on the probe. So plot(2,1) is the channel physically just above plot(3,1), and you're seeing PC1 vs Time for each. This would really help see drift.

rossant commented 9 years ago

Great stuff in there!

(where depth is a feature computed for each spike - did I request this as a feature? I think it's a really good idea)

what is it exactly? is it the "waveform mean position"?

nsteinme commented 9 years ago

Yes, the y-coordinate of it. Except for each spike individually, rather than for the mean waveform.

On Wed, Jun 17, 2015 at 7:57 PM, Cyrille Rossant notifications@github.com wrote:

Great stuff in there!

(where depth is a feature computed for each spike - did I request this as a feature? I think it's a really good idea)

what is it exactly? is it the "waveform mean position"?

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-112913723.

nsteinme commented 9 years ago

More generally, if I compute a quantity of my own for each spike, can I plot that quantity as a feature in feature view? This would be a really great thing!

On Wed, Jun 17, 2015 at 8:09 PM, Nick Steinmetz nick.steinmetz@gmail.com wrote:

Yes, the y-coordinate of it. Except for each spike individually, rather than for the mean waveform.

On Wed, Jun 17, 2015 at 7:57 PM, Cyrille Rossant <notifications@github.com

wrote:

Great stuff in there!

(where depth is a feature computed for each spike - did I request this as a feature? I think it's a really good idea)

what is it exactly? is it the "waveform mean position"?

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-112913723.

rossant commented 9 years ago

great idea! you can already register your own cluster statistic, we could do the same for spikes, and then add an option to plot it in the feature view

what do you have in mind specifically? would it be ok if the interface was something like:

def myspikequantity(session, cluster):
    # Get the spikes in that cluster
    spike_ids = session.model.spike_train(cluster)
    ...
    return somearraywithonevalueperspike

Basically computing your quantities on a per-cluster basis: this would just make my life simpler because this mechanism already exists...

nsteinme commented 9 years ago

Yep, I think that would do it. I don't think computing on a per-cluster basis is a problem. By the way, would the thing you compute be stored to hard drive so it's available for the next time that you start phy on the same dataset?

On Thu, Jun 18, 2015 at 10:08 AM, Cyrille Rossant notifications@github.com wrote:

great idea! you can already register your own cluster statistic, we could do the same for spikes, and then add an option to plot it in the feature view

what do you have in mind specifically? would it be ok if the interface was something like:

def myspikequantity(session, cluster):

Get the spikes in that cluster
spike_ids = session.model.spike_train(cluster)
...
return somearraywithonevalueperspike
Basically computing your quantities on a per-cluster basis: this would just make my life simpler because this mechanism already exists...

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-113083451.

rossant commented 9 years ago

By the way, would the thing you compute be stored to hard drive so it's available for the next time that you start phy on the same dataset?

No. The assumption is that custom stats are fast to compute and they are always computed on-the-fly. But that's something we could change if necessary. How intensive are your functions? What is it that you want to compute exactly?

(that's also the case for the default stats, except mean features/masks/waveforms which take a lot if time to compute due to the high I/O cost of loading everything in RAM...)

nsteinme commented 9 years ago

The main thing is computing the x,y location of each individual spike, which optimally would be based on the waveform amplitude on each channel for each spike. That might be fairly intensive. You might be able to do it on the fly for a selection of waveforms.

Another thing that I would sometimes find useful (and I know at least two other people who think it is a major feature) is computing the value of the waveform at a particular sample. I.e. just to take the value session.store.waveforms(clusterID)[spikeNum, specifiedSampleNum, specifiedChannelNum]. Then plotting that as a feature so you can use the lasso to split on it.

rossant commented 9 years ago

You don't have fast access to the waveforms of all spikes. They are fetched dynamically from the raw data, which is slow for thousands of waveforms. So instead, phy samples ~100 waveforms per cluster and saves them for fast per-cluster read access.

So you can compute quickly anything based on the waveforms as long as only this small subset of waveforms is used. Anything that uses all waveforms will be extremely slow.

nsteinme commented 9 years ago

That is an argument for being able to save the thing you compute so you just have to do it once, yeah? You'd want to compute the depth for all the ones that are plotted in feature view, not the ones that are plotted in waveform view.

On Thu, Jun 18, 2015 at 12:21 PM, Cyrille Rossant notifications@github.com wrote:

You don't have fast access to the waveforms of all spikes. They are fetched dynamically from the raw data, which is slow for thousands of waveforms. So instead, phy samples ~100 waveforms per cluster ad saves them for fast per-cluster read access.

So you can compute quickly anything based on the waveforms as long as only this small subset of waveforms is used. Anything that uses all waveforms will be extremely slow.

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-113118237.

nsteinme commented 9 years ago

Or, how about this, let's say I computed the waveform positions myself and saved them to a file. Can I just load that file and use the values as a feature to plot in feature view?

On Thu, Jun 18, 2015 at 12:24 PM, Nick Steinmetz nick.steinmetz@gmail.com wrote:

That is an argument for being able to save the thing you compute so you just have to do it once, yeah? You'd want to compute the depth for all the ones that are plotted in feature view, not the ones that are plotted in waveform view.

On Thu, Jun 18, 2015 at 12:21 PM, Cyrille Rossant < notifications@github.com> wrote:

You don't have fast access to the waveforms of all spikes. They are fetched dynamically from the raw data, which is slow for thousands of waveforms. So instead, phy samples ~100 waveforms per cluster ad saves them for fast per-cluster read access.

So you can compute quickly anything based on the waveforms as long as only this small subset of waveforms is used. Anything that uses all waveforms will be extremely slow.

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-113118237.

rossant commented 9 years ago

Or, how about this, let's say I computed the waveform positions myself and saved them to a file. Can I just load that file and use the values as a feature to plot in feature view?

that would work, but it's kind of a hack...

Ok, so that suggests a different approach:

there is a new notion of spike statistics for computing statistics that do not depend on the cluster assignements (this is critical)
you provide a function mystuff(spike_start, spike_stop) to compute all stats for the spikes between spike_start (included) and spike_stop (excluded). This is for performance/RAM reasons: it will be computed on a chunk-by-chunk basis (e.g. the first 10,000 spikes, the next 10,000 ones, etc). The function must return a NumPy array with spike_end-spike_start elements in the first axis. To access features/masks/waveforms from within the function, you cannot use the store, but instead the model which memmaps the HDF5 file. So something like W = model.waveforms[spike_start:spike_end,...].
the function is called during the store creation, and the results are saved in a single big file with the stats of all spikes
fetching the stats during the interactive session is very fast, and you can plot them in the feature view

cc @nippoo

nippoo commented 9 years ago

OK - so I'm not sure how useful this will be, in reality. The concept of an amplitude discriminator at a fixed time offset seems nice in theory, but isn't this the same goal that the PCs are trying to achieve? If you're finding that the features don't discriminate adequately enough between waveforms, perhaps try increasing the number of PCs per waveform from 3 to 5 or something?

Arbitrary user-defined amplitude splitting is always going to be REALLY slow (aka cups-of-coffee slow) since you're effectively bypassing PCA and going back to the waveforms. We should really avoid any clustering method based on the raw waveforms, since IIRC you should be able to get as much information as you need out of the features. (Correct me if I'm wrong)...

I'm not entirely confident of the maths, but @kdharris101 or @shabnamkadir might have some ideas to suggest about the best way of going around this?

nsteinme commented 9 years ago

The problem arises when some neuron has a really unusual waveform, and this waveform shape is not represented well at all in the PCs. Typically, for instance, this could be some kind of brief up-tick before the main negative deflection. Then the PCs won't have it, but you can see with your eyes that you could easily pick out the spikes from this neuron just by that one sample. What if it were just computed for the currently selected clusters?

And adding more PCs isn't a good solution: it will take a lot longer to do the clustering and there's no guarantee those extra ones will get it, particularly if it is a small number of spikes in question (since they won't contribute much to the total variance).

On Thu, Jun 18, 2015 at 5:00 PM, nippoo notifications@github.com wrote:

OK - so I'm not sure how useful this will be, in reality. The concept of an amplitude discriminator at a fixed time offset seems nice in theory, but isn't this the same goal that the PCs are trying to achieve? If you're finding that the features don't discriminate adequately enough between waveforms, perhaps try increasing the number of PCs per waveform from 3 to 5 or something? Or perhaps there's even a way of defining the waveform amplitude at fixed time offset to be a feature and using that at SpikeDetekt-time?

Arbitrary user-defined amplitude splitting is always going to be REALLY slow (aka cups-of-coffee slow) since you're effectively bypassing PCA and going back to the waveforms. We should really avoid any clustering method based on the raw waveforms, since IIRC you should be able to get as much information as you need out of the features. (Correct me if I'm wrong)...

I'm not entirely confident of the maths, but @kdharris101 https://github.com/kdharris101 or @shabnamkadir https://github.com/shabnamkadir might have some ideas to suggest about the best way of going around this?

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-113201718.

nippoo commented 9 years ago

We could write a function quite easily to split a cluster in two based on amplitude above/below a certain point, clickable in the WaveformView. But I reckon it'd be something along the lines of:

some keyboard shortcut + click on point in WaveformView (maybe highlight spikes below/above in a certain colour)
press split
wait a minute (probably about 1ms per spike)
result

nippoo commented 9 years ago

So for a cluster with 100,000 spikes it'd take about 1-2 minutes, as a rough guess.

rossant commented 9 years ago

I feel like this is the sort of things that's best done from IPython rather than hard-coded in the GUI...

nsteinme commented 9 years ago

Well, I'll put it this way. This is a feature that you can do in Plexon Offline Sorter (software I used to use), and I used it there sometimes. When I showed KV to Marius Bauza (O'Keefe lab - you remember him) he was adamant that this feature was critical. Then Dan Denman (Allen Inst) recently told me he thought this was an important feature. So that's why it occurred to me to bring it up. I think those people would be willing to pay 30 seconds for it (it won't be a cluster with 100k spikes or its waveform shape *would have been picked up by the PCA). But Kenneth thinks it's kinda hack-y and a little dubious, which I don't totally disagree with, so I don't have a problem with letting it be a DIY-in-ipython feature, at least for now, pending further user demands. I'll try to send you a screenshot next time I see where this would be useful.

Nick

Denman: "i've always had good results with the time domain is well - if the spikes are well-aligned, choosing a value at a time point away from the amplitude [trough or somewhere in repolarization] provides good separation."

On Thu, Jun 18, 2015 at 5:14 PM, nippoo notifications@github.com wrote:

So for a cluster with 100,000 spikes it'd take about 1-2 minutes, as a rough guess.

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-113205821.

nippoo commented 9 years ago

Same. Unless it's something you'll end up doing regularly...? If this is more than a once-off thing and it'll actually be useful, there should be some way of clicking to select the discriminator's standard and visually confirm it's about right.

rossant commented 9 years ago

anyway, this is something that might be partially doable with this suggestion -- do you still need it? If so, I'll open a new issue.

nippoo commented 9 years ago

I feel like that might be too general a solution to a specific problem - probably YAGNI for the moment, though I think I've come round to Nick's viewpoint that we probably need something to deal with this particular issue.

rossant commented 9 years ago

well, I think there are many other examples when you'll want to compute custom per-spike data and plot it (like position of the mouse at the time of every spike etc.) no?

nsteinme commented 9 years ago

Well there's a big difference between per-spike data that is computed from the waveforms, and that which is computed from just the spike times (like your position example). So I think you'd handle that differently - you'd already have a vector of x and y positions (small enough for RAM by far), and you'd just index into it.

On Thu, Jun 18, 2015 at 5:28 PM, Cyrille Rossant notifications@github.com wrote:

well, I think there are many other examples when you'll want to compute custom per-spike data and plot it (like position of the mouse at the time of every spike etc.) no?

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-113209398.

rossant commented 9 years ago

@nsteinme I'm gonna do this. What about the following API:

FEATURE VISUAL (low-level, rare user access)

read-only props:
view.x_dimensions = n*n array of strings or tuples (eg 'time', (7, 1), etc) describing the dimension in the x axis of every subplot
view.y_dimensions = n*n array of strings or tuples

function to change the dimensions:
set_x_dimensions({(1, 1): 'time', (2, 3): 'myextrafeature'})

FEATURE VIEW MODEL (high-level, user-exposed)

default matrix:

time, depth     time,    (x, 0)     time,    (y, 0)     time, (z, 0)

time, (x', 0)   (x', 0), (x, 0)     (x', 1), (y, 0)     (x', 2), (z, 0)

time, (y', 0)   (y', 0), (x, 1)     (y', 1), (y, 1)     (y', 2), (z, 1)

time, (z', 0)   (z', 0), (x, 2)     (z', 1), (y, 2)     (z', 2), (z, 2)

read-only props:
vm.x_channels = [x', y', z']
vm.y_channels = [x, y, z]

API to change the channels
vm.set_x_channel(1, y'')  # change the channel in the 3rd row (0-based indexing, and the first row is always time!)

nippoo commented 9 years ago

I'm not 100% sure what this API is supposed to achieve - but if I understand right, you're suggesting we should be able to define an arbitrary feature as the amplitude at a certain point and calculate it for all spikes? If so, I think this isn't really a very good solution to the problem - this will be for one cluster and the time offset will probably be different every time you want to do it, and you'll want to discard this information as soon as the split is complete.

But maybe I'm understanding wrongly?

kdharris101 commented 9 years ago

This is something the very early spike sorting systems had (e.g. Michael Recce’s software).

At that time I was against it as it makes it easy to “carve out” things that look like clear spikes from pure MUA. But I guess it could be useful in a modern context.

From: nippoo [mailto:notifications@github.com] Sent: 19 June 2015 11:31 To: kwikteam/phy Cc: Harris, Kenneth Subject: Re: [phy] Improve default dimensions in feature grid view (#360)

I'm not 100% sure what this API is supposed to achieve - but if I understand right, you're suggesting we should be able to define an arbitrary feature as the amplitude at a certain point and calculate it for all spikes? If so, I think this isn't really a very good solution to the problem - this will be for one cluster and the time offset will probably be different every time you want to do it, and you'll want to discard this information as soon as the split is complete.

But maybe I'm understanding wrongly?

— Reply to this email directly or view it on GitHubhttps://github.com/kwikteam/phy/issues/360#issuecomment-113462993.

nsteinme commented 9 years ago

There are a couple of things this is addressing (correct me if I'm wrong, Cyrille):

1) Allowing the user to choose what's shown in the features grid 2) enabling the display of custom features (for example, the main one that the discussion started with is depth along the probe of each spike, which I believe we all think would be very useful. For another example, it would also allow the kind of amplitude-at-a-single-time-sample type feature that became the later part of the discussion.)

Looks good to me!

Nick

On Fri, Jun 19, 2015 at 11:35 AM, Kenneth Harris notifications@github.com wrote:

This is something the very early spike sorting systems had (e.g. Michael Recce’s software).

At that time I was against it as it makes it easy to “carve out” things that look like clear spikes from pure MUA. But I guess it could be useful in a modern context.

From: nippoo [mailto:notifications@github.com] Sent: 19 June 2015 11:31 To: kwikteam/phy Cc: Harris, Kenneth Subject: Re: [phy] Improve default dimensions in feature grid view (#360)

I'm not 100% sure what this API is supposed to achieve - but if I understand right, you're suggesting we should be able to define an arbitrary feature as the amplitude at a certain point and calculate it for all spikes? If so, I think this isn't really a very good solution to the problem - this will be for one cluster and the time offset will probably be different every time you want to do it, and you'll want to discard this information as soon as the split is complete.

But maybe I'm understanding wrongly?

— Reply to this email directly or view it on GitHub< https://github.com/kwikteam/phy/issues/360#issuecomment-113462993>.

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/phy/issues/360#issuecomment-113465690.

rossant commented 9 years ago

@nsteinme yes you understood correctly; you can define any n_spikes-long array and plot it wherever you want

cortex-lab / phy

Improve default dimensions in feature grid view #360

Get the spikes in that cluster