Label mismatch in WaveformView and FeatureView vs. Correlogram View and n_spikes

omeleavitt commented 3 months ago

Currently running Windows 11 and data sorted from Kilosort 3. Whenever I perform a split, the Feature View and Waveform View show features that are internally consistent (in my screenshot below, the red unit in the Waveform View is the noise pulled out by splitting in FeatureView). However, the n_spikes and correlogram are reversed, making it impossible to tell which is the "real" spike. Incidentally the amplitude view is not showing data. This is not associated with any errors in the command prompt as far as I can tell. Any help would be appreciated.

zm711 commented 2 months ago

best to run phy template-gui params.py --debug and list out the contents if a view is failing.

Could you explain your other problem a bit more. What do you mean by being unable to tell which is the "real"? And by spike you mean unit/cluster?

gminami commented 1 month ago

Hi @zm711 , this is a known issue of Kilosort 3. I found many claiming this issue here and there. For instance, #1257 is referring to the same problem. Please look at the waveforms in @omeleavitt 's screenshot. In WaveformView, blue cluster has decent spikes whereas red cluster seems to be just noise. On the other hand, if you compare the spike number on ClusterView, red cluster seems to have most number of spikes, suggesting that this is the 'core' of the original cluster (i.e. spikes inside the convex hull) and the remaining blue cluster is noise (outside of convex hull). Same applies to CorrelogramView, ISIView and other views. To sum up, all the views but WaveformView tell that blue is noise (outliers) and red being within-convex-hull spike cluster. Do you have any solution to solve the mismatch between WaveformView and all other views?

zm711 commented 1 month ago

I think it would be easier if I could open this myself and poke around. Does anyone have a small dataset that reproduces this problem (maybe <2GB?)?

We would need to know if this is a phy/phylib or a ks issue. And having data that produces this would be easiest!

gminami commented 1 month ago

Hi @zm711 , thank you for your prompt reply and interest to this issue! Please look into and try the dataset. Let me know if you have any trouble browsing/downloading. I prepared a real, but modified data file so that the total size would be a little less than 2GB. I could upload a small dataset, but the caveat is that the above-mentioned issue is somehow data size (probably spike number)-dependent. I ran "GM-tryKS3-F27\rec1\run_KS3.m" and moved all the outputs to GM-tryKS3-F27\temp_kilo3. As you run Phy and circle any cluster (doesn't matter mua or good) that has a large number of spikes in a way that most of spikes but some are included, and then split, you should be able to reproduce what I and others are talking about.

zm711 commented 1 month ago

Thanks @gminami. I'm booked up for the next couple days, so my hope is to check this out in more depth on Friday!

gminami commented 1 month ago

@zm711 looking forward to your update!! FYI, the following is what happens with the dataset I sent. If you circle all the points but one on FeatureView and split (1st screenshot), you'll get numbers of blue waveforms that are similar to template and a somewhat deviated red waveform, indicating blue was the ones originally inside the circle and red was outside. On the other hand, ClusterView reports majority of spikes in the red cluster, not blue. Interestingly, this mismatch doesn't occur when you circle the minority or the outliers (in this case, circling the one on the left bottom).

zm711 commented 3 weeks ago

@gminami,

I have a hypothesis! So some of the views only load a random sampling of the spikes in order to not be too intensive. For example if you look at your waveforms in the WaveformView there are not 30,000 traces there right? So The FeatureView is only showing the features of the same subset of neurons. If you instead curate based on the AmplitudeView (of 38 to be similar to your image) we see that the numbers make a lot more sense. But in the feature view we still see very few red dots.

This is a limitation of these types of GUIs. It would be way to slow to display all the data. This is true in SpikeInterface and the SpikeInterface GUI as well. For that we let you choose the spikes you want (although not as freely as we want yet). But Phy (at the gui-level) chooses for you.

This is why splitting is so difficult vs merging at the gui level. The FeatureView can't possible show you all the information without crashing the app. So you make splits based on imperfect info. Merges are easier to attempt.

Does all this make sense?

gminami commented 3 weeks ago

@zm711

So some of the views only load a random sampling of the spikes in order to not be too intensive.

Yes, I think this feature of Phy is reasonable. Such sparse representation per se hasn't been a problem as a 'cloud' of points lets you estimate the distribution of a cluster. I've been able to split clusters just fine in KS1 up to KS2.5.

The issue is not sparse representation of data, but FALSE representation; splitting of a large cluster in Phy2 after KS3 displays wrong data -- data that does not belong to the current cluster.

In the example I sent, either the waveforms or points on the feature view are not real (because there is a mismatch), I suspect.

Since KS3 does not create npy files for Phy that's needed for FeatureView and others, I needed to run phy extract-waveforms params.py in my python environment to create those (according to what some KS3 users claim to do). This may be the cause of malfunction.

Is there a way to fix this? For instance, if you come up with a way to create feature-related npy files for Phy on MATLAB (as in all other versions of kilosort), that would be great. I'm afraid Phy2 has less than half the value without splitting function in FeatureView.

zm711 commented 3 weeks ago

Let me answer in reverse because the last point is easiest.

For instance, if you come up with a way to create feature-related npy files for Phy on MATLAB (as in all other versions of kilosort), that would be great. I'm afraid Phy2 has less than half the value without splitting function in FeatureView

that needs to be done by the kilosort team. They had plans to do it but then dropped it when they started working on KS4. So you’d have to do that yourself. Sorry.

Since KS3 does not create npy files for Phy that's needed for FeatureView and others, I needed to run phy extract-waveforms params.py in my python environment to create those (according to what some KS3 users claim to do). This may be the cause of malfunction.

yes people do this to get features. But with the extraction process you lock in your random spikes. So the views can’t update in the future. I think you could make an argument that the distribution chosen for some clusters is poor but once you’ve extracted that distribution you’re locked in.

The issue is not sparse representation of data, but FALSE representation; splitting of a large cluster in Phy2 after KS3 displays wrong data -- data that does not belong to the current cluster.

I don’t understand this part. I don’t see anything that struck me as false. It struck me more that different views have different amounts of data available and so they update what they can. That doesn’t make anything false again it’s a limitation of these gui representations. But if you explain this more maybe I can be convinced.

But I want to emphasize overall you’re right that a big different between <3 and 3 is that KS didn’t finish the npy writing which means that the FeatureView can only have the info you extract rather than updated views based on the prewritten full files. So again not false but just limited compared to what was possible before.

You could try to request from the KS team but they said before they wouldn’t work on it so I don’t think they will.

cortex-lab / phy

Label mismatch in WaveformView and FeatureView vs. Correlogram View and n_spikes #1261