SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
532 stars 187 forks source link

Functionality to merge recordings after splitting #2033

Closed JoeZiminski closed 6 months ago

JoeZiminski commented 1 year ago

I would like to provide an option to split recordings before preprocessing so recordings can be pre-processed per shank. There are alternatives (e.g. groups in the common_reference module) but this is not always provided (e.g. spatial filtering). The problem I am having is that after sorting by group, I had planned to aggregate the sorting results. But, the preprocessed data is still split by group, so I cannot figure out a way to handle the inputs for waveform extraction, because the sorting results can be aggregated but the preprocessed recordings are still split.

One solution would be to re-merge the split, preprocessed (i.e. written to binary) recordings. Is this possible in SI? I would be happy to write a method for this, although I can see why you might not want to include this functionality as it is a little confusing. For example, a user might split the recording, perform lazy pre-processing and then merge, but when preprocessing is actually performed it is no longer split by group. Nonetheless this could be handled with documentation / warnings or other checks.

alejoe91 commented 1 year ago

Hi @JoeZiminski

Yes both are possible! You should be able to use the aggregate_channels and aggregate_units function to merge both. https://spikeinterface.readthedocs.io/en/latest/modules/core.html#manipulating-objects-slicing-aggregating

Basically, you can simply do:

recording_aggr = si.aggregate_channels(list_of_split_recordings)
sorting_aggr = si.aggregate_units(list_of_split_recordings)

# optional but convenient:
sorting_aggr.register_recording(recording_aggr)

we = si.extract_waveforms(recording_aggr, sorting_aggr, ...)

Let me know if that's what you needed!

alejoe91 commented 1 year ago

I also think that it would be nice to have a GroupByRecording preprocessor, so all the preprocessing by group (or property) and aggregation is done automatically. What do you think?

An example would look like this:

recording_np2_4shanks = se.read_spikeglx(...)

recording_group_by = spre.group_by(recording_np2_4shanks, property="group")

recording_cmr = spre.common_reference(recording_group_by)

We could make a mechanism (even at the base preprocessor) to split->apply->merge traces!

@samuelgarcia @h-mayorquin what's your opinion?

alejoe91 commented 1 year ago

It could even be in core, and just have a fancy get_traces implementation :)

JoeZiminski commented 1 year ago

Thanks @alejoe91! Completely missed that. I think your suggestion would be really nice, it would be very convenient and reduce the liklihood of people trying this themselves and getting into trouble with the lazy aspect as mentioned above. e.g. (I think this would not work?):

split_recording_dict= recording.split_by("group")

new_dict = {}
for key, rec in split_recording_dict.items()
    new_dict[key] = bandpass_filter(rec)

recording = aggregate_channels(split_recording_dict)

data_not_preprocessed_by_group = recording.get_traces(...)
alejoe91 commented 1 year ago

@JoeZiminski what you suggest would work :)

JoeZiminski commented 12 months ago

Thanks Alessio! Sorry, I also noted an error in my above example, the new_dict should have been aggregated not the split_recording_dict! To be a 100% sure, in the below (clearer) example, at the get_traces call would bandpass_filter be processed as a whole and common_reference by group?

bandpass_recording = bandpass_filter(recording)

split_bandpass_recording_dict= bandpass_recording .split_by("group")

new_dict = {}
for key, rec in split_bandpass_recording_dict.items()
    new_dict[key] = common_reference(rec)

preprocessed_recording= aggregate_channels(new_dict)

data = preprocessed_recording.get_traces()

BTW will start working on a GroupByRecording PR

alejoe91 commented 12 months ago

Hi @JoeZiminski

The only thing that you need to change in your snippet is that aggregate_channels needs a list as input:

recording_list = []
for key, rec in split_bandpass_recording_dict.items()
    recording_list.append(common_reference(rec))

preprocessed_recording= aggregate_channels(recording_list)

BTW will start working on a GroupByRecording PR

Sounds great! :)

JoeZiminski commented 12 months ago

Thanks @alejoe91 that's amazing that works! very flexible

JoeZiminski commented 12 months ago

Hey @alejoe91 looking into this a bit further I guess there are a few ways it could operate. I wonder now if it might be easier to add this as a high-level argument to get_traces() (e.g. split_by_property="xxx"). But maybe I am missing another obvious route! The ways I thought of were:

1) A GroupByRecording object that in it's get_traces() method splits the recording, calls get_traces() on the parent recording object, then merges. This might be a bit confusing, as would only operate on preprocessing steps before itself.

e.g.

bp_recording = bandpass_filter(recording)
split_preprocessed_recording = group_by_recording(bp_recording)
cr_recording = common_reference(split_preprocessed_recording)

would only split bandpass but not common reference.

2) Have BasePreprocessor check the recording metadata for a GroupByRecording step and if it exists, performs splitting and merging at a high-level. The downside with this is that no matter where group_by_recording is applied, all preprocessing steps will be performed after splitting, which could be confusing.

3) Implement as an argument directly on BasePreprocessor instead rather than through a GroupByRecording object. This will perform all preprocessing split by property.

h-mayorquin commented 12 months ago

@alejoe91 It flew over me that you @ me above a while ago.

We could make a mechanism (even at the base preprocessor) to split->apply->merge traces!

This is an interesting design problem. My two cents:

Given that everything at the recording level is lazy would not something like this suffice? It could even be a function:

from spikeinterface.preprocessing.basepreprocessor import BasePreprocessor
from spikeinterface.core.baserecording import BaseRecording
from spikeinterface.core.channelsaggregationrecording import ChannelsAggregationRecording

class GroupByPreprocessing(ChannelsAggregationRecording):

    def __init__(recording: BaseRecording, property_to_spit_by: str, preprocessing_list: list[BasePreprocessor]):
        splited_recording_dict = recording.split_by(property_to_spit_by)
        recordings_to_aggregate = []
        for split_recording in splited_recording_dict.values():
            # Pre-processing as a pipeline
            preprocessed = split_recording
            for preprocessor in preprocessing_list:
                preprocessed = preprocessor(preprocessed)

            recordings_to_aggregate.append(preprocessed)

        # Aggregating the recordings

        super().__init__(recording_list=recordings_to_aggregate)

The route of having a fancier get_traces will imply that downstream analysis would have to be aware that this object has a different signature for get_traces which I think will bring difficulties. Better to coerce the pipeline so it behaves as yet another BaseRecording object and then it can be used everywhere else.

alejoe91 commented 12 months ago

Thanks @h-mayorquin

I agree that designing something at base would be hard. @JoeZiminski what do you think of Ramon's proposal? We could also propagate preprocessing_params as an input argument

JoeZiminski commented 11 months ago

Thanks @h-mayorquin @alejoe91! I think this looks like a nice pattern and removes any complications from playing with base. I had a proper look through aggregate_recordings and understand it better now. To check my understanding, the above class could be used like this (assuming preprocessing_list and preprocessing_params read off the recording in this case):

shift_recording= phase_shift(recording)
bp_recording= bandpass_filter(shift_recording)
grouped_recording = group_by_preprocessing(bp_recording, property_to_split_by="group")  # GroupByPreprocessing
# now calling get_traces() on grouped_recording, data would be preprocessed per-group

# adding another step
reference_recording = common_reference(grouped_recording)
# in this case, called get_traces on reference_recording, phase_shift and bandpass 
# would be preprocessed per group, then referencing would be preprocessed with all groups?

If what I have said above is correct, the naming / documentation just needs to make clear that every step before the group_by_recording is applied will be grouped. My understanding is the operation will have to group the preprocessing steps either before or after the operation is applied (unless it automatically groups everything wherever it is placed in the preprocessing chain, which is confusing). Grouping the steps before it is applied feels intuitive (and makes it a nice step just to apply at the end of pre-processing to group everything).

alejoe91 commented 11 months ago

Yep! I agree that once you group, the splitting/re-aggregating pattern will only affect the steps prior to grouping! Everything that comes after will be applied regardless of grouping.

I like the idea of automatically grabbing preprocessing steps and parameters from an existing object, but we have to be careful here, due to the flexibility of the API.

A corner and complex case could be the following:

shift_recording= phase_shift(recording)
bp_recording= bandpass_filter(shift_recording)
grouped_recording = group_by_preprocessing(bp_recording, property_to_split_by="group")
reference_recording = common_reference(grouped_recording)
# what to do here???
grouped_recording_second = group_by_preprocessing(reference_recording, property_to_split_by="group")
JoeZiminski commented 11 months ago

Good point! Maybe in this case, if group_by_preprocessing step is already in the preprocessing list, an error can be raised? My feeling is this provides a nice convenience function to preprocess an entire preprocessing chain by group, but if people want to do very complex things, they could use the lower-level API?

JoeZiminski commented 11 months ago

Following #2299 it was determined this was not simple to implement in a robust way and so further documentation on the existing API will be added in a new PR.

JoeZiminski commented 6 months ago

Some extra docs were added in #2316 I'll close this now.