NeuralEnsemble / python-neo

Neo is a package for representing electrophysiology data in Python, together with support for reading a wide range of neurophysiology file formats
http://neo.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
321 stars 248 forks source link

Spiketrain conversion from kilosort outputs - 'spike_times.py' and 'spike_clusters.py' #1412

Closed taningh86 closed 7 months ago

taningh86 commented 7 months ago

Hi! I am working with spikeGLX data and using the 'spike_times.py' and 'spike_clusters.py' outputs from kilosort 2.5 to convert into spiketrains list for downstream analysis in Elephant package. However, I am not getting the right cluster_id's when I analyze them using elephant package and I think it's because the spiketrains data structure I am getting is not correct. Below is the code I use to generate spiketrains list using Neo. You can see that I make sure to check if all unique clusters are being processed and the output that I get shows that all unique clusters are indeed processed.

# Count the total number of unique clusters
total_unique_clusters = len(unique_clusters)

# Print the total number of unique clusters
print("Total number of unique clusters:", total_unique_clusters)

# Inside the loop, add a counter to keep track of how many clusters are processed
processed_clusters = 0

# For each unique cluster (neuron), create a SpikeTrain object with a unique ID
# For each unique cluster (neuron), create a SpikeTrain object for this cluster
for cluster_id in unique_clusters:
    print("Processing cluster_id:", cluster_id)

    # Get the spike times for this cluster
    cluster_spike_times = spike_times[spike_clusters == cluster_id]

    # Create a SpikeTrain object for this cluster with a unique ID (e.g., cluster_id)
    spike_train = SpikeTrain(cluster_spike_times, t_stop=t_stop, units='s', name=str(int(cluster_id)))

    # Add the SpikeTrain to the Segment
    segment.spiketrains.append(spike_train)

    # Increment the counter
    processed_clusters += 1

# Print the total number of clusters processed
print("Total number of clusters processed:", processed_clusters)`

And, below are the cluster_id's being processed.

Total number of unique clusters: 170
Processing cluster_id: 0
Processing cluster_id: 1
Processing cluster_id: 2
Processing cluster_id: 3
Processing cluster_id: 4
Processing cluster_id: 5
Processing cluster_id: 6
Processing cluster_id: 7
Processing cluster_id: 8
Processing cluster_id: 9
Processing cluster_id: 10
Processing cluster_id: 11
Processing cluster_id: 13
Processing cluster_id: 14
Processing cluster_id: 15
Processing cluster_id: 16
Processing cluster_id: 17
Processing cluster_id: 18
Processing cluster_id: 19
Processing cluster_id: 20
Processing cluster_id: 21
Processing cluster_id: 22
Processing cluster_id: 23
Processing cluster_id: 24
Processing cluster_id: 25
Processing cluster_id: 26
Processing cluster_id: 27
Processing cluster_id: 28
Processing cluster_id: 29
Processing cluster_id: 30
Processing cluster_id: 31
Processing cluster_id: 32
Processing cluster_id: 33
Processing cluster_id: 34
Processing cluster_id: 35
Processing cluster_id: 36......
Processing cluster_id: 170
Total number of clusters processed: 170`

And below is the output I get for print(segment.spiketrains)

[<SpikeTrain(array([2.15332585e-02, 4.75665014e-02, 6.51997735e-02, ...,
       1.81889782e+03, 1.81893855e+03, 1.81897822e+03]) * s, [0.0 s, 1818.998848575428 s])>, <SpikeTrain(array([1.14499602e-01, 7.44730747e-01, 1.01186315e+00, ...,
       1.81442366e+03, 1.81443600e+03, 1.81748755e+03]) * s, [0.0 s, 1818.998848575428 s])>, <SpikeTrain(array([6.06997892e-02, 1.73532731e-01, 1.92465998e-01, ...,
       1.81876368e+03, 1.81886392e+03, 1.81891558e+03]) * s, [0.0 s, 1818.998848575428 s])>, <SpikeTrain(array([3.16665567e-03, 2.57332440e-02, 4.38331811e-02, ...,
       1.81894222e+03, 1.81896152e+03, 1.81898535e+03]) * s, [0.0 s, 1818.998848575428 s])>, <SpikeTrain(array([2.67999069e-02, 3.30332186e-02, 1.01166315e-01, ...,
       1.81897472e+03, 1.81899372e+03, 1.81899885e+03]) * s, [0.0 s, 1818.998848575428 s])>, <SpikeTrain(array([1.37799521e-01, 2.24299221e-01, 5.32064819e-01, ...,
       1.81535033e+03, 1.81548169e+03, 1.81878032e+03]) * s, [0.0 s, 1818.998848575428 s])>, <SpikeTrain(array([6.49331078e-02, 1.34532866e-01, 1.51699473e-01, ...,
       1.81361510e+03, 1.81415243e+03, 1.81580836e+03]) * s..

Its a list of spiketrains. However, the issue I am having is that the cluster_id is not correctly identified. For 0-170 as cluster_id's I am getting Neuron ID in the plot below in the range of 0-1700. I cannot figure out why is that. Is it a problem with how ASSET from elephant identifies clusters or is it an issue with the structure of spiketrains generated by Neo. Will very much appreciate any help with this.

ASSET intersection matrix
zm711 commented 7 months ago

If you type len(segment.spiketrains), what do you get?

Also kilosort returns spiketimes in frames/samples not time did you convert? We are missing a couple things in your script to know what is really going on.

Also I edited your prompt to make it easier to read. The triple tick is way better for code so that we can see what's happening :)

taningh86 commented 7 months ago

Hi Zm711! Sorry about the bold texts in the codes. I did not realize it would come out that way with the '<>' options on the taskbar. Thanks for correcting it. I can imagine how unpleasant it would be to read with those giant bold letters. Anyways, for len(segment.spiketrains) I get 170, which is the number of unique clusters. And, sorry I forgot to mention that the spike_times output from kilosort is already converted in seconds (see the file below). And here's the complete code I use to generate spiketrains list. There are some unnecessary lines of codes to check if all unique clusters are actually processed and whats the total no of processed clusters.

from neo import SpikeTrain

# Load spike times and spike clusters
spike_times = np.load('G:/Cat_GT_Out/catgt_12_29_23_redo_ACA_LHA_24FAST_EXP_g0/results_KS25/sorter_output/spike_times_sec.npy')
spike_clusters = np.load('G:/Cat_GT_Out/catgt_12_29_23_redo_ACA_LHA_24FAST_EXP_g0/results_KS25/sorter_output/spike_clusters.npy')

# Assuming t_stop is the total duration of your recording
t_stop = np.max(spike_times)

# Get the unique cluster IDs (i.e., the neuron IDs)
unique_clusters = np.unique(spike_clusters)

# Create a Segment to hold the spike trains for each neuron
segment = Segment()
# Count the total number of unique clusters
total_unique_clusters = len(unique_clusters)

# Print the total number of unique clusters
print("Total number of unique clusters:", total_unique_clusters)

# Inside the loop, add a counter to keep track of how many clusters are processed
processed_clusters = 0

# For each unique cluster (neuron), create a SpikeTrain object with a unique ID
# For each unique cluster (neuron), create a SpikeTrain object for this cluster
for cluster_id in unique_clusters:
    print("Processing cluster_id:", cluster_id)

    # Get the spike times for this cluster
    cluster_spike_times = spike_times[spike_clusters == cluster_id]

    # Create a SpikeTrain object for this cluster with a unique ID (e.g., cluster_id)
    spike_train = SpikeTrain(cluster_spike_times, t_stop=t_stop, units='s', name=str(int(cluster_id)))

    # Add the SpikeTrain to the Segment
    segment.spiketrains.append(spike_train)

    # Increment the counter
    processed_clusters += 1

# Print the total number of clusters processed
print("Total number of clusters processed:", processed_clusters)

Also, I found out that the range in the Neuron ID in the plot above is a function of 'bin_size' in the ASSET analysis. If I put bin_size = 10pq.s instead of bin_size = 1000pq.ms then I get the right range of clusters, which is 170. Below is the code for ASSET.

asset_obj = ASSET(segment.spiketrains, bin_size = 10 *pq.s)

# Compute the intersection matrix
imat = asset_obj.intersection_matrix()

# Plot the intersection matrix
plt.matshow(imat)
plt.colorbar()
plt.xlabel('Neuron ID')
plt.ylabel('Neuron ID')
plt.title('Intersection Matrix')
plt.show()

i have raised this question in Elephant's github page. But if you have any insight on this then please share.

ASSET intersection matrix 1
zm711 commented 7 months ago

@taningh86,

Based on what I see it looks like you are making your SpikeTrain objects fine. I'm not sure how ASSET works, so I think that part of the question really does belong with Elephant. Hopefully they can explain the dependence on bin size. I did a little mock test from my computer with simulated SpikeTrains and saw something similar where when I had 12 SpikeTrains the Neuron ID varied from 5-12 depending on bin size. So definitely ask them.

I went to their issue tracker and saw the suggestion for a SpikeGLX extractor. In this case that is probably not needed. You already probably used the extractor (based on your folder format I'm guessing you use SpikeInterface). What you could do is use spikeinterface.extractors.read_kilosort to read in the sorting results then do sorting.to_spike_vector() to basically get the same data provided by reading in the .npy files that you do above. It is just more personal preference I guess.

taningh86 commented 7 months ago

Hi Zach, Thanks a lot for giving it a look. At this point, I have also landed in the conclusion that it is something to do with how ASSET reads the 'bin_size'. I tried running it without the bin_size and it runs into memory error. I have reach out to people at elephant already and hoping they will get back to me soon. Yeah, I used spikeinterface and ran kilosort in a containerized version and got all the kilosort outputs. The info you gave about sorting.to_spike-vector() is very helpful. I was wondering if that would give the spiketrainlist how Neo does. I will give it a try tomorrow and how things work. Will updae you. Thanks again for your help.

zm711 commented 7 months ago

It returns a numpy array of all your spike_times and unit_indices (not really cluster ids, but an index to easily go back to spikeinterface).

So something like:

spike_vector = sorting.to_spike_vector(concatenate=True) # this will mimic how kilosort data is stored in .npy files
spike_times_samples = spike_vector["sample_index"]
spike_cluster_indices = spike_vector["unit_index"]

From there you would repeat your code (so convert spike_time_samples to spike_time_seconds) and iterate through the unique indices. The one thing is that in order to know the cluster_id from kilosort you have to do this

sorting.unit_ids[0] # this will return the cluster_id of the 0 index from the spike vector

But to use elephant you would still need to use your code above to put into SpikeTrains.

taningh86 commented 7 months ago

Thanks Zach. I am running it now and will update how things work out. I already have got a workflow to convert spiketimes/sample to seconds. Looking forward to find out how to_spike_vector will be different then the kilosort output of spike_times.npy.

samuelgarcia commented 7 months ago

Hi, We have a from_neo_spiketrain_list() here https://github.com/SpikeInterface/spikeinterface/blob/main/src/spikeinterface/core/numpyextractors.py#L397 maybe we should implement a to_neo_spiketrain_list in spikeinterface.

taningh86 commented 7 months ago

Hi Samuel! to_neo_spiketrain_list would be a good option to have.

zm711 commented 7 months ago

@taningh86, not to bounce you around, but could you post that request in spikeinterface issue tracker and link this issue. That way we can keep track of things. From the Neo side we can't put it on the todo list, but if you post this request in spikeinterface we can put it on our todo list there.

zm711 commented 7 months ago

I'm going to close this for now since we've solved the spiketrain issue. But definitely open something on spikeinterface if you're interested @taningh86.

taningh86 commented 7 months ago

Hi Zach. Sorry, I have not don it yet but will certainly do this weekend. Thanks again!

taningh86 commented 7 months ago

Hi Zach. its up in spikeinterface issue page.