SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
480 stars 183 forks source link

Proposal for internal handling of `time_vector` #3029

Open cwindolf opened 1 month ago

cwindolf commented 1 month ago

Following some discussions with @samuelgarcia and @JoeZiminski , I wanted to open an issue to work out the details of handling time_vectors in a way that won't crash everyone's computers. I'm opening a separate issue from https://github.com/SpikeInterface/spikeinterface/issues/2627, since this will focus on implementation details and will not change the logic.

Summary of what exists:

Summary of problems:

I propose that we can still efficiently implement time vectors with two changes:

I think this could be addressed by a PR that doesn't affect any other code, and would allow time_vectors in the motion correction interpolation.

cwindolf commented 1 month ago

A PR here could try to fix https://github.com/SpikeInterface/spikeinterface/issues/2515 as well

h-mayorquin commented 1 month ago

Question: In what case do we need to call time_to_sample_index repeteadly?

My prior is that np.searchsorted is not gona be the bottleneck for anything reasonable if we accept vectors in the corresponding functions: image

So I think the main concerns is going to be memory as you were mentioning.

Can you remind me on how the time vector is used in motion interpolation?

cwindolf commented 1 month ago

Good point! Seems like the search is not going to be a bottleneck. Even a pretty aggressive version of your benchmark is fast:

In [11]: time_vector = np.arange(600_000_000.0) / 30_000

In [12]: search_times = np.linspace(500_000_000.0 / 30_000, 500_090_000.0 / 30_000, num=300)

In [13]: %%timeit
    ...: np.searchsorted(time_vector, search_times)
    ...: 
    ...: 
90.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In motion interpolation, sample_index_to_time() is called to determine each sample's time in seconds, and these times are used to look up the drift for each sample. time_to_sample_index() is not used, which would be the slow one -- if that operation even were slow, which as we can see it is not. So @samuelgarcia maybe there is no performance issue with having a time_vector in interpolation?