SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
531 stars 188 forks source link

Add PreprocessingPipeline #3438

Open chrishalcrow opened 2 months ago

chrishalcrow commented 2 months ago

A proposal to add a PreprocessingPipeline class, which contains ordered preprocessing steps and their kwargs in a dictionary.

You can apply the class to a recording, or use the helper function create_preprocessed to make a preprocessed recording:

preprocessor_dict = {'bandpass_filter': {'freq_max': 3000}, 'common_reference': {}}

# apply using
from spikeinterface.preprocessing import PreprocessingPipeline
pipeline = PreprocessingPipeline(preprocessor_dict)
preprocessed_recording = pipeline.apply_to(recording)

# or
from spikeinterface.preprocessing import create_preprocessed
preprocessed_recording = create_preprocessed(recording, preprocessor_dict)

Also adds a function which takes in a recording.json provenance file and make a preprocessor_dict:

from spikeinterface.preprocessing import get_preprocessing_dict_from_json
my_dict = get_preprocessing_dict_from_json('/path/to/recording.json')

This allow for some cool things:

  1. Users can pass a single dictionary to construct a preprocessed recording (as above). Hence it completes the “dictionary workflow”; since you can use dicts in sorting, run_sorter_jobs, and postprocessing in compute.
  2. Users can easily visualise their preprocessing pipeline using the repr, including an HTML repr in Jupyter notebook (I made a hideous one, but we can aim for something like the sklearn pipeline repr see https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_display.html)
  3. Increases portability between labs, since you can reconstruct the preprocessing steps from the recording.json file without the original recording (and worrying about paths).

Note that 3. only works for preprocessing steps that are in some sense “global” i.e. can be applied to any recording. This doesn’t apply for all preprocessing steps e.g. interpolate_bad_channels needs the bad_unit_ids which are recording dependent. However, many of these functions can be modified to be applied more globally e.g. if bad_unit_ids is None, interpolate_bad_channels could detect bad channels, then interpolate these. This would be apply-able to any recording, so is “global”.

No rush on this and I’m not 100% set on it being implemented. Important to get the names right. I read this: https://melevir.medium.com/python-functions-naming-tips-376f12549f9. I think it’s important that create_preprocessed doesn’t sound in-place, after the number of problems with set_probe. Hence I’m against something like apply_preprocessing(recording), and would rather have make, create, construct, produce or something in the function name. I also like the idea (from the article) that you don’t need to include e.g. recording in the name if recording is a required argument. Hence I like something like my_pipeline.apply_to(recording) rather than something like my_pipeline.apply_pipeline_to_recording(recording).

To do: