cortex-lab / phy

phy: interactive visualization and manual spike sorting of large-scale ephys data
BSD 3-Clause "New" or "Revised" License
314 stars 157 forks source link

weird CCG bug #705

Closed nsteinme closed 7 years ago

nsteinme commented 8 years ago

I'm seeing a weird bug where the CCG changes depending on how many units I have selected, or more specifically based on how many total spikes are in the selected units. Basically, if I look at the shape of the CCG for a pair of units, then I start selecting more and more units to see more pairs of CCG (keeping the original two selected), the shape of the CCG of the original two will change, in such a way as to make it appear to oscillate around 2.5Hz.

Here's an example. First a pair: image

Then selecting one more: image

So far so good, now selecting a few more: image

Note that the original pair - blue and red - has now taken on this same weird shape. These plots are with bin size = 1ms, window = 10000ms. It comes on gradually as I select more and more, not all at once.

The effect depends on the total number of spikes selected - I can select a dozen clusters without seeing any of this effect if they all have a small (<5000) number of spikes. But selecting just a few that have a huge number of spikes (~100,000) will give the effect. You don't really see it so much at small window sizes (like 100ms), just because it is a slow thing, but it definitely doesn't depend on window size. Because it depends on the number of spikes selected, and not on the window size, or on the size of the CorrelogramView window (meaning the size of the box on the screen), I don't think it's any kind of graphical aliasing.

Let me know if further details would be helpful.

rossant commented 8 years ago

This probably comes from the fact that, for performance reasons, the CCGs are computed by using N spikes across all selected clusters. N is fixed and doesn't depend on the number of selected cluster. When the number of selected clusters increases, the number of spikes per cluster decreases.

The fix would be to use N*n_clusters spikes, at the expense of performance when selecting multiple clusters. Please let me know the smallest N possible that would give sensible results (currently it is 100,000).

rossant commented 8 years ago

This should now be fixed in optim.

nsteinme commented 8 years ago

Thanks, I will test it. I'm sure that will resolve it for me since in my particular dataset the biggest clusters have only about 120,000 spikes. But I think it must not be about the number of random spikes that are chosen, instead about the way in which they are chosen? I guess you must be choosing them in blocks, is that right? Since it is just selecting items from a single vector, would it be so much slower to pick them truly randomly? I think this would resolve it for clusters of any size, rather than just pushing the problem off until someone comes along with a dataset 10x a long as mine.

On Fri, Jul 22, 2016 at 8:54 AM, Cyrille Rossant notifications@github.com wrote:

This should now be fixed in optim.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kwikteam/phy/issues/705#issuecomment-234478309, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP1gCXhVfR9xW4mai1z7RDghz-FZkks5qYHcjgaJpZM4JSSKZ .

rossant commented 8 years ago

I'm not sure I'm following you. The problem you have is, I think, related to the fact that the number of spikes used for computing the CCG for any given cluster, depends on the total number of selected clusters. With the latest change in the code, the number of spikes used in each cluster will not depend on the number of selected clusters. So you shouldn't see this problem, even with huge clusters containing millions of spikes.

As for choosing spikes randomly, I fear that you might miss fast correlations? Because there would be a low probability for two close spikes to be selected for the computation of CCGs. This is the reason why I compute the CCGs by selecting B contiguous blocks of S spikes (considering the sorted array with all spike times across all clusters).

nsteinme commented 8 years ago

I'm not sure about your logic with the fast correlations... I think the all CCG bins will shrink equally toward zero as you drop out more and more spikes, not that short intervals will be particularly influenced. Think of it this way: if, using all spikes from clusters A and B, there were 10 instances of coincidences with lag = -100ms, and 10 instances of coincidences with lag = -2ms, then this means there are exactly 10 spikes you could choose to drop from cluster A that would reduce the -100 bin; there are also exactly 10 spikes you could choose to drop that would reduce the -2 bin. Those 10 spikes in each case are not special, they will be dropped with equal probability. The whole CCG will reduce equally. By selecting contiguous blocks, instead, you are introducing this spurious structure. Specifically, if the block size is say 1 sec, and your original ACG had 10 spikes with lag = -1100 ms, the probability of those specific spikes getting dropped approaches 1. If the blocks are the same blocks across all clusters (I infer that they are) then the same argument applies to the CCGs.

I vote to choose at least N per cluster (as you have implemented) and having them be randomly selected, not in blocks. I could be wrong about this! If you think so, we should ask Kenneth about it, I think he will be an accurate source of a correct answer.

On Fri, Jul 22, 2016 at 11:44 AM, Cyrille Rossant notifications@github.com wrote:

I'm not sure I'm following you. The problem you have is, I think, related to the fact that the number of spikes used for computing the CCG for any given cluster, depends on the total number of selected clusters. With the latest change in the code, the number of spikes used in each cluster will not depend on the number of selected clusters. So you shouldn't see this problem, even with huge clusters containing millions of spikes.

As for choosing spikes randomly, I fear that you might miss fast correlations? Because there would be a low probability for two close spikes to be selected for the computation of CCGs. This is the reason why I compute the CCGs by selecting B contiguous blocks of S spikes (considering the sorted array with all spike times across all clusters).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kwikteam/phy/issues/705#issuecomment-234513295, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP8tmrL1_SCBl08LpjfegdJNa4D0wks5qYJ8BgaJpZM4JSSKZ .

rossant commented 8 years ago

I see how selecting blocks would favor fast correlations to the disadvantage of slow ones. But wouldn't selecting spikes at random in the spike_times array (all spikes across all clusters, sorted) induce a bias with clusters that have very different sizes?

nsteinme commented 8 years ago

I'm not sure, I don't think so, but in any case why not select the N=100,000 spikes randomly for each cluster, independently?

On Fri, Jul 22, 2016 at 12:51 PM, Cyrille Rossant notifications@github.com wrote:

I see how selecting blocks would favor fast correlations to the disadvantage of slow ones. But wouldn't selecting spikes at random in the spike_times array (all spikes across all clusters, sorted) induce a bias with clusters that have very different sizes?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kwikteam/phy/issues/705#issuecomment-234524285, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP4mDcCQpvFdMDTjzq0SPd89Bw0piks5qYK6ugaJpZM4JSSKZ .

rossant commented 8 years ago

It's a bit more expensive as it cannot be vectorised easily, but I'll have a go.