SortingAnalyzer export to phy show different waveforms than the native sorting

MarinManuel commented 5 days ago

Hi, I couldn't track the issue that I am having, hopefully someone else could have a better idea of what I am doing wrong.

I run kilosort2.5 on some NPX data (recorded with Open ePhys). My workflow is:

preprocess data (time shift, bandpass, CAR, remove bad channels, etc) and save the pre-processed data as zarr
run kilosort and save in folder
load the sorting from the folder, create a sortingAnalyzer, compute quality metrics
export to phy to look at the units.

I noticed that when I open the phy folder created from the sortingAnalyzer, the units appear mostly (though not exclusively) like noise. However, opening phy from the sorter_output folder created when running kilosort, the units are overwhelming good quality.

See screenshot below: Screenshot from 2024-11-14 15-12-20 The left window is from opening the sorter_output directly, with the right window is opening the version exported by sortingAnalyzer. Notice how different the traces look, and the waveforms, and the PCAs. The only things that are not affect seem to be autocorrelogram, ISIview and FRView.

I'm attaching the script I am using. run_sorter_slurm.txt

zm711 commented 5 days ago

What file are you pointing to in the params.py for the export_to_phy. in your script I don't see any filtering or referencing. KS2.5 saves a .dat file that has been filtered and referenced. Could you try to filter and reference your recording before making your sorting analyzer?

MarinManuel commented 4 days ago

Hi Zach, You're right, I forgot to include part of the code

rec = si.concatenate_recordings(recs)
rec_NPX = rec
rec_NPX = si.phase_shift(rec_NPX)
rec_NPX = si.bandpass_filter(rec_NPX, freq_min=300, freq_max=6000)
rec_NPX = si.common_reference(rec_NPX, operator="median", reference="global")
bad_channel_ids, channel_labels = si.detect_bad_channels(rec_NPX)
rec_NPX = rec_NPX.remove_channels(bad_channel_ids)
target_file = scratch_folder / 'all_sessions_probeA_pre_processed.zarr'
if target_file.exists():
    rec_NPX = si.read_zarr(target_file)
else:
    rec_NPX = rec_NPX.save(format='zarr', folder=target_file, verbose=True, **job_kwargs)

The pre-processed file is saved as zarr and then passed to the script above to run the sorter.

I open phy using:

phy template-gui all_sessions_probeA_pre_processed_kilosort2_5_sorting/sorter_output/params.py for the version created by kilosort2_5

and

for the version exported from the SortingAnalyzer phy template-gui all_sessions_probeA_pre_processed_kilosort2_5_phy/params.py

MarinManuel commented 4 days ago

this is the content of params.py from kilosort2.5:

dat_path = '/scratch2/workspace/mmanuel_uri_edu-2024-11-13/all_sessions_probeA_pre_processed_kilosort2_5_sorting/sorter_output/temp_wh.dat'
n_channels_dat = 384
dtype = 'int16'
offset = 0
sample_rate = 30000.
hp_filtered = True

from export_to_phy:

dat_path = r'/scratch2/workspace/mmanuel_uri_edu-2024-11-13/all_sessions_probeA_pre_processed_kilosort2_5_phy/recording.dat'
n_channels_dat = 384
dtype = 'int16'
offset = 0
sample_rate = 30000.0
hp_filtered = True

alejoe91 commented 4 days ago

Hi @MarinManuel

Can you check which Kilosort version you're using?

You should make sure you checkout v2.5.2 and v3.0.2. In the *.0.1 versions, there was a misalignment issue that got fixed.

MarinManuel commented 3 days ago

Hi Alessio, I confirm that I am using v2.5.2 from https://github.com/MouseLand/Kilosort/releases/tag/v2.5.2

MarinManuel commented 2 days ago

I'm pretty sure I've got the same problem using Kilosort4. The output from the SortingAnalyzer's export_to_phy is full of noisy units

I'm sure I should get much better than that, although I cannot open the params.py from the sorter_output folder because of this error https://github.com/cortex-lab/phy/issues/1273 . What is the easiest way to convert my pre-processed zarr file into a format that can be read by phy?

alejoe91 commented 2 days ago

@MarinManuel do the templates look correct when you use the sw.plot_unit_templates function?

MarinManuel commented 1 day ago

I guess maybe I misunderstand how the SortingAnalyzer is supposed to be used. I had assumed the unit ids would stay the same, but that the analyzer would compute a number of other metrics, but maybe that's incorrect?

Here is unit 2 straight from Kilosort 2.5 (not a great one)

Here is unit 2 from the SortingAnalyzer's export_to_phy

and here are the templates using sw.plot_unit_templates(analyzer, unit_ids=[2]) which look like the ones shown in phy

I also looked at the units with the largest number of spikes. This is unit 159, with 475745 spikes (the second largest, the first one was noise)

But when I open the SortingAnalyzer version, the unit with the second to largest number of units has 475733 spikes and is number 138 but the template that seem to match using sw.plot_unit_templates seem to be number 159

alejoe91 commented 1 day ago

Ah the problem is that Phy needs linearly increasing integers as ids..you can see there is a "si unit id" column in Phy that corresponds to the unit id in spikeinterface :)

MarinManuel commented 1 day ago

Ok, I see that unit 139 in phy has number 159 in the column si_unit_id so that makes sense. I still don't understand why the number of spikes, the traceview and the templates are different between the two versions

alejoe91 commented 1 day ago

Did you do any curation step? Like removing duplicated spikes??

MarinManuel commented 1 day ago

hmmm yes, my script does

sorting = sorting.remove_empty_units()
sorting = si.remove_duplicated_spikes(sorting, censored_period_ms=0.3, method="keep_first_iterative")
sorting = si.remove_excess_spikes(sorting=sorting, recording=rec)
sorting = si.remove_redundant_units(sorting, align=False, remove_strategy="max_spikes")

before creating the SortingAnalyzer, so that makes sense, thanks.

What about the TraceView? Here are the first 250ms of the file, which looks nothing alike after the SortingAnalyzer

alejoe91 commented 1 day ago

The sorting analyzer will use the preprocessed recording, while kilosort uses the original dat file (without any preprocessing). That should explain the difference!

MarinManuel commented 23 hours ago

I think I understand what you mean, but where do all those units come from after the SortingAnalyzer? if they were not there in the output of Kilosort? Or is it just that the templates are present over a larger number of channels after SortingAnalyzer, so there seem to be more of them?

alejoe91 commented 13 hours ago

you don't have more units :) Just the IDs might be different.

You can check as follows:

print(len(sorting_analyzer.unit_ids))

# read the kilosort folder
sorting_ks = se.read_kilosort("path-to/sorter_output")
print(len(sorting_ks.unit_ids))

MarinManuel commented 9 hours ago

Thank you Alessio, I think I am starting to realize where my assumptions were wrong. You have to admit, though, that the TimeViews look very different. I mean in terms of the colored spikes which appear much denser after the SortingAnalyzer

alejoe91 commented 9 hours ago

It's very hard to say from the plot, but yeah the trace views should look different ;) Can we close the issue?

SpikeInterface / spikeinterface

SortingAnalyzer export to phy show different waveforms than the native sorting #3544