A detailed guide on preprocessing

JoeZiminski commented 3 weeks ago

For ephys pipelines there are a number of quite complex preprocessing steps that need to be done in a specific order. Also, there are some nuances to some preprocessing steps, particulaly around multi-shank and bad channel detection. For example CAR, IBL bad channel detection*, highpass spatial filter etc make assumptions on channel location that are broken in the multi-shank contex. Bad channel detection should not be done after CAR etc.

In general I think these preprocessing steps are all quite mysterious if you are not familiar with them. I've been meaning for ages to write a detailed guide on preprocessing steps (what and why for each step) but never gotten around to it, and speaking to people at the conference I think it would be useful. So I'll post this issue to get the ball rolling and invite contributions, maybe opening a PR next week!

*BTW I just realised this yesterday for IBL bad channel detection, maybe we should add an assert / warning?

h-mayorquin commented 3 weeks ago

This would be great!

Let's keep issues here that are about preprocessing:

https://github.com/SpikeInterface/spikeinterface/issues/2994

chiyu1203 commented 3 weeks ago

Thanks for initiating this! I have not yet been able to help with the documentation but I would like to point out another preprocessing step that confuses me. I learned about preprocessing steps from the latest SpikeInterface_Tutorial.ipynb, where saving Saving SpikeInterface objects is introduced after Preprocessing, so I followed those steps for my analysis. However, I read @alejoe91 paper in 2023 about compressing ephys data. In the paper, they discussed the benefits (e.g. boost compression ratio) and drawbacks (e.g. causing lossy compression) of applying pre-processing step (e.g. band-pass filter) before compressing the raw data. So, is it better to compress and save the raw data directly and only do pre-processing steps when applying spike sorting to the data?

JoeZiminski commented 1 week ago

Hmm good question @chiyu1203 I would be interested in hearing other perspectives but I think it depends a little on your own preference, in terms of whether to store forever the 'true' raw data or a slightly-pre-processed version.

If you are using NP probes, I guess you would want to do phase_shift, bandpass_filter then store. These are minimal preprocessing steps and so you may not regret keeping your data in this format. However, things do change and new methods for preprocessing are sometimes introduced, for example the introduction of the phase_shift step itself, or there are ongoing discussions on accounting for headstage filter nonlinear phase distortions #2943. So you never know what the preprocessing pipeline might be in a few years. Maybe one day you might want to revisit the data with a more up-to-date preprocessing pipeline and wish you had the 'true' raw data. For me, I would err on storing the raw data considering it is probably only a relatively minimal improvement (figure 7 in paper) you get in compression after bandpass filtering. But both approaches have merit!

h-mayorquin commented 1 week ago

For long term storage and data sharing such as the on Dandi you want the data to be as raw as possible. You don't know how future users would like to use your data and pre-processing steps might fall out of fashion. as @JoeZiminski indicates.

For working projects you want to keep both for the sake of computational efficiency but be sure to keep the raw data. Then, when you feel that your dataset is in such state that you want to share it with someone else that is not you (publish or an archive for example) you go back to step 1.

Note that this is only valid for data whose processed versions is very large. For results that are really computing heavy or hard to reproduce it is a good idea to share the most direct input that supports a hypothesis (spike times for example).

JoeZiminski commented 1 week ago

(Note) there is no way to skip highpass filter (300 Hz corner freq) in KS4

SpikeInterface / spikeinterface

A detailed guide on preprocessing #2996