SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
495 stars 187 forks source link

Writing phy data resets manual curation #3247

Closed RikkelBob closed 1 week ago

RikkelBob commented 1 month ago

First of all, spikeinterface is amazing and it's helped me a lot. My pipeline is nearly finished, except for one issue. I sorted my data with mountainsort5, and manually split/merged clusters in phy. After merging/splitting, I want to recalculate some quality metrics, waveforms, pca, etc. However, if I load my sorting and write a new phy folder, all the labeling (good, mua, noise) and merging/splitting is gone. The snippet below reproduces this behavior.

Am I wrong to assume that read_phy reads the manually curated data? If so, is there a method to do so? Or is the problem with export_to_phy? I'm using SI 0.98.1.

from pathlib import Path

import numpy as np
from spikeinterface.exporters import export_to_phy
from spikeinterface.extractors import read_phy

import spikeinterface as si

phy_dir = r"redacted"
rec_folder = r"redacted"
sorting = read_phy(phy_dir)
ac_dir_phy = Path(phy_dir, "test")

fs = 32000

#Get channel locations from .npy file
path_channel_location = Path(phy_dir, 'channel_positions.npy')
channel_locations = np.load(path_channel_location)
num_channels = channel_locations.shape[0]
ac_dir_wave = Path(phy_dir, "test", "waveforms")

#Load the preprocessed recording
recording = si.read_binary(Path(rec_folder, "recording.dat"),
                           sampling_frequency=fs,
                           num_channels=num_channels,
                           dtype="int16")

recording.set_dummy_probe_from_locations(channel_locations)
recording.annotate(is_filtered=True)

we_sub = si.extract_waveforms(recording, sorting, ac_dir_wave, sparse=True,
                              dtype='int16', return_scaled=False, overwrite=True)

export_to_phy(we_sub, output_folder=Path(ac_dir_phy, 'phy'), copy_binary=False)`
zm711 commented 1 month ago

Hey @RikkelBob,

We are so happy that you appreciate the project :)

We have added some updates to allow for better propagation of these labels back into phy for more recent versions (0.101.0). Unfortunately my mind is a little foggy on if there is a hack that we could do in 0.98.x that would allow you to do the same. The more annoying hack would be to do this outside of phy. I think @jakeswann1 had suggested a way to do this with pandas so if he has a moment maybe he can give you his trick for doing this.

If you do decide to update we have one api change (WaveformExtractor -> SortingAnalyzer). We have a translation layer to bring all your old data objects into the new data model if you want to use the new export features so I'll include the documentation here

jakeswann1 commented 1 month ago

I'm using a 0.101 so not sure whether the same applies for your version but from my experience read_phy or read_kilosort retain the splitting and merging from phy, as well as the assigned labels. My suspicion is that export_to_phy is the problem here, but I can't say for sure.

If you want to export some computed metrics to phy for further curation, you can do this by exporting them to a .tsv file in the phy folder, with cluster_group as the index label, and your metrics in the other columns. You can look at the cluster_info.tsv or cluster_group.tsv to see the format

In the 0.101.0, quality metrics are exported by default, discussed further in #3112

Hope this helps!

zm711 commented 1 month ago

Right you were doing QC stuff. In your case @RikkelBob case you would need to take your labels and add them to cluster_info.tsv, but definitely read the discussion Jake linked. The general idea would be to make a Pandas series and just make that into a column in the cluster_info.tsv. We added a way to propagate this in 0.101.0 automatically in the export_to_phy function.

RikkelBob commented 2 weeks ago

Thanks for all the feedback! By updating to 0.101 and adding additional_properties=['quality'] to export_to_phy, cluster_quality.tsv is created. This adds a column for quality to phy. This is different from how phy usually handles quality (e.g., by coloring the clusters green, grey or dark grey). This can be resolved by adding quality to cluster_info.tsv. However, I did not find a way to automatically add this property to cluster_info.tsv. In fact, cluster_info.tsv is not generated by exporting to phy, but only after manually editing clusters in phy, and then saving. I may be looking over something obvious, but automatic generation of cluster_info.tsv would be ideal. If there is a way to do so, I'd love to hear it.

zm711 commented 2 weeks ago

I think the design choice was that Phy will read in any .tsv file and display it. So we can safely add extra info into phy without risk of misbehaving with Phy, but by writing that specific file we would need to be careful not to mess with the info. Is there another reason why we would not want to write directly to cluster_info.tsv @alejoe91 ?

jakeswann1 commented 2 weeks ago

A possible issue is that if you split/merge clusters in phy, cluster_info.tsv is updated automatically with the basic fields, and any plugins which make a column there. These are recalculated for the new cluster. When you export_to_phy from spikeinterface, this is with a static sorting, and there is no way to generate new labels for new clusters. You can obviously leave these blank when a new row in the file is created, but that's not super useful when the main advantage of phy over sortingview or the spikeinterface GUI is the splitting & merging capability (for now 👀)

zm711 commented 2 weeks ago

I remember patching an issue with post-splitting in the cluster_info.tsv in our read_phy function so I think messing with phy functionality stuff can be a little tricky in general as @jakeswann1 points out. You like the coloring just for easy of looking? What value do you get out of the colors vs just filtering off of the column label in the gui?

RikkelBob commented 1 week ago

@zm711 for me, it's mostly an ease-of-use thing when considering using the generated data outside of SI. I previously read cluster_info.tsv for all cluster properties, but reading the .tsv files separately is easily implemented so it's fine.

One thing I would possibly change is to save unit quality by default when using export_to_phy, instead of having to pass additional_properties=['quality'] as an argument.

Anyway, I think we can consider this as resolved. Thanks for your help!