SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
495 stars 187 forks source link

Question about read spikeGLX #2177

Closed ZoeChen96 closed 10 months ago

ZoeChen96 commented 10 months ago

Dear, My cooperating engineer shares me a dataset, which seems to be a spikeGLX (from the file ABOUT THIS DATASET.txt that I will attach below, with the structure:

chen14@medic05.imec.be> cd set1/
chen14@medic05.imec.be> ls
'ABOUT THIS DATASET.txt'              curated_ST      set1_siteCoords.txt
 SC026_080619_g0_tcat.imec2.ap.meta   set1_data.bin

He used to read the .bin with MATLAB and successfully run his sorter, while I want to use Spikeinterface if possible. I tried simply like this:

import spikeinterface.extractors as se

folder = "/imec/other/macaw/projectdata/neuropixels_datasets/set1/"
rec = se.read_spikeglx(folder_path=folder)

But returns the error below. Could you please help me have a look?

{
    "name": "IndexError",
    "message": "list index out of range",
    "stack": "---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/imec/other/macaw/chen14/pysimulator/maindevelop.ipynb Cell 17 line 1
----> <a href='vscode-notebook-cell://ssh-remote%2Bchen2/imec/other/macaw/chen14/pysimulator/maindevelop.ipynb#X22sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a> rec = se.read_spikeglx(folder_path=folder)

File /imec/other/macaw/chen14/anaconda43/envs/tfqu1016/lib/python3.8/site-packages/spikeinterface/extractors/neoextractors/spikeglx.py:51, in SpikeGLXRecordingExtractor.__init__(self, folder_path, load_sync_channel, stream_id, stream_name, all_annotations)
     49 def __init__(self, folder_path, load_sync_channel=False, stream_id=None, stream_name=None, all_annotations=False):
     50     neo_kwargs = self.map_to_neo_kwargs(folder_path, load_sync_channel=load_sync_channel)
---> 51     NeoBaseRecordingExtractor.__init__(
     52         self, stream_id=stream_id, stream_name=stream_name, all_annotations=all_annotations, **neo_kwargs
     53     )
     55     # open the corresponding stream probe for LF and AP
     56     # if load_sync_channel=False
     57     if \"nidq\" not in self.stream_id and not load_sync_channel:

File /imec/other/macaw/chen14/anaconda43/envs/tfqu1016/lib/python3.8/site-packages/spikeinterface/extractors/neoextractors/neobaseextractor.py:210, in NeoBaseRecordingExtractor.__init__(self, stream_id, stream_name, block_index, all_annotations, use_names_as_ids, **neo_kwargs)
    205         raise ValueError(
    206             f\"This reader have several streams: \
Names: {stream_names}\
IDs: {stream_ids}. \"
    207             f\"Specify it with the 'stream_name' or 'stream_id' arguments\"
    208         )
    209     else:
--> 210         stream_id = stream_ids[0]
    211         stream_name = stream_names[0]
    212 else:

IndexError: list index out of range"
}
ZoeChen96 commented 10 months ago

this is the ABOUT THIS DATASET.txt file:

This recording was done by Susu Chen, as part of the MAP project in Karel Svoboda's lab.

The recording is a NP 1.0 probe in mouse thalamus, in a task that involves licking -- the large artifacts are licking events.

The only preprocssing of the binary files is concatenation of 80 ~7 sec trials using CatGT. Consult SC026_080619_g0_tcat.imec2.ap.meta to see the CatGT command line (which includes no filtering, and excludes the tshift operation). Consult the README for CatGT (https://billkarsh.github.io/SpikeGLX/#catgt) for more information about content of the other files.

The trials recordings were overlapped in time, so the binary data are the same as a continuous recording.

The post-curation spike table is in curated_ST. This should be used in conjunction with the _QM.txt file, which has curator calls for each unit identified. Only units identified as 'single' or 'ok' are likely to be reproduced in a fresh sort of the data. 

The attached binary is the first 536 seconds of a much longer recording. The spike table and quality metrics are from the whole recording. A great online sorter would find 'most' of the spikes in 'single' and 'ok' units in the spike table up to 530 sec (sample 15,900,000).

The binary files can be conveniently viewed using the SpikeGLX viewer. On a machine with no recording hardware, download SpikeGLX and run SpikeGLX_NISIM.exe.
alejoe91 commented 10 months ago

Hi, the SpikeGLX reader assumes the data has been generated by SpikeGLX without file naming modifications. It appears that the .bin file here doesn't have the original name.

Can you try to rename it to: SC026_080619_g0_tcat.imec2.ap.bin?

ZoeChen96 commented 10 months ago

Yes, it works! Thanks a lot! A quick follow up question: do you know how can I read the ground truth in curated_GT/ ? is that also a standard sorted object format? it's like this: (sorted with kilosort and then curated in JRclust interface) image. The .txt file is attached below

ZoeChen96 commented 10 months ago
This is the information:
These recordings were all done by Susu Chen, while working Karel Svoboda's lab at Janelia research  campus. They form part of the 'MAP' project, looking at the involvement of many brain regions in mice performing a short term memory task. The trials are ~7 seconds long (and corresponding patterns can be seen in the data).

Processing before curation:
CatGT, version 2.0 (filter = biquad, no tshift)
-lf -aphipass=300 -aplopass=9000 -lfhipass=0.1 -lflopass=300 -gbldmx -gfix=0,0.09,0.02 

The data was sorted with Kilosort 2.0, and curated using the JRClust interface. Attached quality metrics were calculated using the JRClust code. 

The curation protocol was developed by Susu Chen and Jennifer Colonell, based on Susu's curation expertise. The protocol was refined in collaboration with 3 'production curators' in Janelia's Project Technical Resources group: Claire Managan, Rebecca Arruda, and Ramya Kappangantula. A copy of the curation protocol is attached is included.

The results of the curation are in the contained in 2 files:

<run name>_ST.txt
<run_name>_QM.txt

*_ST.txt is the spike table, 5 columns, tab delimited:

spike time in samples
cluster labe1, 1-nUnit (one based from JRClust, roughly in position order)
X coordinate of the spike, rounded to nearest um (from center of mass calculation in JRClust)
Y coordinate of the spike, rounded to nearest um (from center of mass calculation in JRClust)
spike amplitude in uV

*_QM.txt is a table of metrics about each unit. The column labels are mostly self explanatory, but a few notes:

unitSNR is calculated by JRClust, Vpp/estimated noise on the peak site
presence ratio is as defined in the ecephys pipeline

ISI metrics are calculated using a refractory period that is brain region dependent, read from the curation parameters, and reported in the column name. Duplicates (spikes within 0.0166
numViolations
ISI_Ratio = firing rate within refractory period/firing rate in 20 ms period about each spike (JRClust ISI metric)
frac_FP = fraction of false positives estiamted according the the Hill formula (see help for ecephys pipeline)

maxTemplateSim is the maximum similarity score between the template for this unit and all other units. Read from KS output.
maWaveformSim is the maximum similarity score between the average wavefrom for this unit and all other units. Read from JRClust output.

isoDist and LRatio are output from JRClust. The definition matches that in the ecephys docs, but the features are found differntly.

note is the curator's call for this unit
callByMetrics is generated by the code that makes the these tables from the JRClust output, applying simple thresholds to metrics. nsg => 'not so good'
xCoord and y Coord of the unit, from c.o.m. calculation in JRClust

R0_est, adjR2, and rmse are results from an attempted fit of the amplitude falloff with distance from the center peak. This calculation is not very valuable for NP1.0 probes. For estimates of spread (which correlates with distance of the unit from the probe), use zFoot = total number of rows with signal > 50 uV, or zFWHM.

spike_duration, spike_fwhm, spike_trough are features of the mean waveform for the unit, calculated as described in the documentation for ecephys.
alejoe91 commented 10 months ago

No that's a txt file. You can parse it with pandas or other external tools ;)

Glad that renaming worked. Closing!