cortex-lab / KiloSort

GPU code for spike sorting
GNU General Public License v2.0
178 stars 100 forks source link

Waveforms #35

Closed micheleacox closed 5 years ago

micheleacox commented 8 years ago

I have a few questions about plotting waveforms in MATLAB after the data is sorted. I am pasting Figure 2 from Pachitariu M. et al., 2016, for reference.

(1) Does KiloSort output waveforms that correspond to each spike? If so, which .npy file and/or rez field contains these? If not, is it best to use the timestamps and retrieve the data directly from the source data file? (2) Can the PC and template outputs of KiloSort be used to create the red traces in the figure below? If so, how?

untitled Pachitariu M, Steinmetz NA, Kadir S, Carandini M and Harris KD (2016). Kilosort: realtime spike-sorting for extracellular electrophysiology with hundreds of channels. bioRxiv dx.doi.org/10.1101/061481

nsteinme commented 8 years ago

Hi Michele,

Yes you'll have to read the waveforms out of the raw file directly using the spike times. Here's a code snippet in matlab that should work:

mmf = memmapfile(params.filename, 'Format', {params.dataType, [nChInFile nSamp], 'x'});

st = readNPY('spike_times.npy'); % these are in samples, not seconds clu = readNPY('spike_clusters.npy'); theseST = st(clu==19); % spike times for cluster 19 extractST = theseST(1:min(100,length(theseST))); %extract at most the first 100 spikes nWFsToLoad = length(extractST); nCh = 64; % number of channels wfWin = [-30:30]; % samples around the spike times to load nWFsamps = length(wfWin); theseWF = zeros(nWFsToLoad, nCh, nWFsamps); for i=1:nWFsToLoad tempWF = mmf.Data.x(1:nChInFile,extractST(i)+wfWin(1):extractST(i)+wfWin(end)); theseWF(i,:,:) = tempWF(params.chanMap+1,:); end

For #2, the red waveforms should be the contents of templates.npy. See here for some more documentation about the npy files produced: https://github.com/kwikteam/phy-contrib/blob/master/docs/template-gui.md

On Thu, Oct 27, 2016 at 8:22 PM, Michele A Cox notifications@github.com wrote:

I have a few questions about plotting waveforms in MATLAB after the data is sorted. I am pasting Figure 2 from Pachitariu M. et al., 2016, for reference.

(1) Does KiloSort output waveforms that correspond to each spike? If so, which .npy file and/or rez field contains these? If not, is it best to use the timestamps and retrieve the data directly from the source data file? (2) Can the PC and template outputs of KiloSort be used to create the red traces in the figure below? If so, how?

[image: untitled] https://cloud.githubusercontent.com/assets/7706858/19781667/eb17c9d2-9c4f-11e6-9b33-2a1e1190dcd0.png Pachitariu M, Steinmetz NA, Kadir S, Carandini M and Harris KD (2016). Kilosort: realtime spike-sorting for extracellular electrophysiology with hundreds of channels. bioRxiv dx.doi.org/10.1101/061481

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cortex-lab/KiloSort/issues/35, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP--pNMWrCdcZ9xotkVTEFQZkt-UQks5q4PnbgaJpZM4Kiufq .

linussun commented 7 years ago

Thanks Nicholas & Kilosort team for this wonderful tool! I am trying to reconstruct waveforms as Michelea is doing above on real world 32 channel array data & the eMouse example in Kilosort. I've been through the documentation and the code you provided above. I have a few questions about this:

  1. template_features_ind.phy in my Kilosort files is a 16x64 matrix. If I understand correctly each column of 16 is associated with each template that is used to extract a spike. In the documentation, it states that these 16 values represent "other Features" and in my data set, the values range from 1-64. Where are these 'other features' defined/specified and what are they exactly?

  2. The templates.npy data are a generic template waveform that captures a spike and it has very low values. The real world waveform raw data from the mmf variable has a much larger values. Is there a matrix/npy file that stores the relative gain to match the template with the actual spike like it is matched in the phy GUI? In addition when using phy to review the Kilosort data, clusters with large waveforms have a single highlighted blue waveform coming from one channel while other clusters (esp those with low amplitude wave forms across many channels) appear to have more than one channel highlighted in blue, what does this represent, and is this stored in any of the Kilosort files?

  3. What is the best way to recreate Figure 2, where red traces overlap the raw waveform? What is the best way to recreate the average waveform as seen in the waveform view in Phy?

Thanks. LDS

marius10p commented 7 years ago

Hi,

  1. The other features are the "template features", described in figure 4 of the paper. These are simply projections of each spike onto all templates. The feature values are actually in template_features.phy, which in your case would be 16 by the number of spikes. The file you pointed out contains indexing information, which is used to sparsify the encoding of these features: for a spike assigned to template N, we only want to compute the projections onto the most similar 16 other templates. For large datasets, this sparsification is crucial for saving disk space.

  2. templates.npy have unit norm across channels and timepoints. The amplitudes.npy scale these templates. Further rescaling is necessary to unwhiten the templates, by multiplying with the inverse of whitening matrix, available in whitening_mat_inv.npy. Both of these operations have already been applied in the matlab results variable: look for rez.Wraw.

The highlights are a different issue and are computed in Phy. I believe for each template, only the channels with a large enough magnitude (relative to the max channel) are colored in blue.

  1. The waveform view in Phy displays the templates, not the average waveform. templates.npy also contains the rank-3 templates, not the average waveform. If you want the true average waveforms, after processing (and while variables are still in your workspace), you could run the function gather_mean_spikes. We should add more documentation to this function, but it should just work.
micheleacox commented 7 years ago

While trying to implement the solution that @nsteinme suggested (thank you!), I am realizing that I don't totally understand how to map cluster ID to channel number within MATLAB (i.e., without having to open PHY). Basically, I'm after the info in the ClusterView potion of the PHY GUI.

nsteinme commented 7 years ago

I think phy gets this as the channel with the maximum amplitude waveform. So just take that cluster's template, which is size nChannels x nSamples, take the max of each channel minus the min (for example) then find the channel with the largest.

On Tue, Dec 6, 2016 at 11:36 PM, Michele A Cox notifications@github.com wrote:

While trying to implement the solution that @nsteinme https://github.com/nsteinme suggested (thank you!), I am realizing that I don't totally understand how to map cluster ID to channel number within MATLAB (i.e., without having to open PHY). Basically, I'm after the info in the ClusterView potion of the PHY GUI.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortex-lab/KiloSort/issues/35#issuecomment-265308431, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP0WMRqsVmsCbcVT7_K8FBIrE4vJqks5rFfGKgaJpZM4Kiufq .

micheleacox commented 7 years ago

In gather_mean_spikes, I get an error with "stimes" (lines 33-37):

clear stimes
% for iNN = 1:size(rez.W,2)
%     stimes{iNN} = rez.st3pos(rez.st3pos(:,2)==iNN,1);
% end
stimes = gtimes;

What is "stimes" suppose to be? I don't have "gtimes" or "rez.st3pos" defined after I run the three main functions (preprocessData, fitTemplates, fullMPMU).

marius10p commented 7 years ago

Not sure what gtimes was supposed to be, but I think stimes has to be the commented text. I have uncommented that, let me know if it goes through (cannot check it right now, sorry).

micheleacox commented 7 years ago

I tried the commented, but my rez structure is missing the required field, "st3pos".

micheleacox commented 7 years ago

Oh, is it possible the spike times?

marius10p commented 7 years ago

I might have changed it to st3, but it's the same field, and yes, it just includes the spike times.

micheleacox commented 7 years ago

Thanks for your help so far @marius10p and @nsteinme.

I ended up with the following function (KiloSort2SpikeStruct.m) that extracts waveforms from the raw data and calculates the channel with the largest template for each cluster. The function takes the rez structure, so others should be able to use it in part or whole. Please let me know if you see anything suspect.

I think the function could be improved by reading just the portions of the binary file needed instead of the whole data file. I tired to do this using fseek, but must have gotten my offset calculation wrong or something. I am posting that bit of code below incase anybody wants to advise on how to make it work.

function out = KiloSort2SpikeStruct(rez)

% load in raw data
fid = fopen(rez.ops.fbinary, 'r');
NchanTOT = rez.ops.NchanTOT;
dat = fread(fid, [NchanTOT inf], '*int16');
fclose(fid);

% organize data with chanMap, remove unconnected channels
dat = dat(rez.ops.chanMap(rez.connected),:);
win = [-50:50];

% extract info from rez
spikeTimes     = rez.st3(:,1);
spikeClusters  = 1+rez.st3(:,5);
spikeTemplates = rez.st3(:,2);

% get raw data around spiketimes
WAVE = NaN(size(dat,1),numel(win),numel(spikeTimes));
for i = 1:length(spikeTimes)
   spkwin = spikeTimes(i) + win; 
    WAVE(:,:,i) = dat(:,spkwin);
end

% find channel index with maximum amplitude template for each cluster
peakChannel = zeros(size(spikeClusters));
uClusters = unique(spikeClusters);
for c = 1:length(uClusters)
    clust     = uClusters(c);
    I         = spikeClusters == clust;
    templates =  unique(spikeTemplates(I));

   t = squeeze(range((rez.dWU(:,:,templates)),1));
   m = max(max(t));
   if any(size(t) ==1)
       chidx = find(t == m);
   else
       [chidx, ~] = find(t == m);
   end
   peakChannel(I) = chidx;

end

out.spikeTimes    = spikeTimes;
out.spikeClusters = spikeClusters;
out.spikeWaves    = WAVE;
out.peakChannel   = peakChannel;
% DEV, to add:
% template waveforms
% channels spanned by each cluster

Here is how I attempted to read in just potions of the raw data, without success:

for i = 1:length(spikeTimes)

    spktm = spikeTimes(i);
    st  = 2* (((spktm-win(1)) * NchanTOT) - NchanTOT);
    dur = range(win);

    fseek(fid, st, 'bof');
    spkdat = fread(fid, [NchanTOT dur], '*int16');

    WAVE(:,:,i) = spkdat;

end
mattschaff commented 7 years ago

@micheleacox - Thank you! This snippet is very helpful in our efforts to visualize waveforms associated with spike times.

@marius10p - Thank you for KiloSort! Does this ex post facto approach of extracting waveforms reflect how KiloSort relates spike times with waveforms? That is, is there this direct temporal relationship, whereby a spike time represents the middle of a template-matching waveform, making a 50-sample-on-each-side window accurate? I ask because we see voltage peaks happening at different times in relation to the spike time when overlaying a cluster's waveforms using the method from @micheleacox.

mattiazanzi commented 1 year ago

@nsteinme I have a similar issue as Michelea had back in 2016. I'm using Kilosort3 to sort data from a Neuronexus 2x32 probe, and I want to import and visualize the detected waveforms in Matlab, to instantiate a broad classification into regular/fast spiking units. To do this, I am using the 'spike_times.npy' and 'spike_clusters.npy' output files from Kilosort3, to read the raw data and extract the raw waveforms directly from there. First of all: is it correct to say that the raw data file is named 'temp_wh.dat'?

Second, I found your code snippet attached to this thread, but find it hard to understand a thing. From the very first line of code,

mmf = memmapfile(params.filename, 'Format', {params.dataType, [nChInFile nSamp], 'x'});

, I understand that you should rely on the content of the 'params.py' file to get the filename, nChInFile, and nSamp information. However, if my intuition is correct, after opening the 'params.py' file the only information I get is structured this way:

dat_path = 'temp_wh.dat' n_channels_dat = 64 dtype = 'int16' offset = 0 sample_rate = 2.441462e+04 hp_filtered = True

Where can I get the 'filename, nChInFile, and nSamp' information? Thanks a lot,

Mattia

kingsEffy commented 9 months ago

Hi,

  1. The other features are the "template features", described in figure 4 of the paper. These are simply projections of each spike onto all templates. The feature values are actually in template_features.phy, which in your case would be 16 by the number of spikes. The file you pointed out contains indexing information, which is used to sparsify the encoding of these features: for a spike assigned to template N, we only want to compute the projections onto the most similar 16 other templates. For large datasets, this sparsification is crucial for saving disk space.
  2. templates.npy have unit norm across channels and timepoints. The amplitudes.npy scale these templates. Further rescaling is necessary to unwhiten the templates, by multiplying with the inverse of whitening matrix, available in whitening_mat_inv.npy. Both of these operations have already been applied in the matlab results variable: look for rez.Wraw.

The highlights are a different issue and are computed in Phy. I believe for each template, only the channels with a large enough magnitude (relative to the max channel) are colored in blue.

  1. The waveform view in Phy displays the templates, not the average waveform. templates.npy also contains the rank-3 templates, not the average waveform. If you want the true average waveforms, after processing (and while variables are still in your workspace), you could run the function gather_mean_spikes. We should add more documentation to this function, but it should just work.

Well, from Kilosort 3 there is no npy file of "template features" generated, right? Or I'm not running it correctly? Please let me know~