Proposal for internal handling of `time_vector`

cwindolf commented 1 month ago

Following some discussions with @samuelgarcia and @JoeZiminski , I wanted to open an issue to work out the details of handling time_vectors in a way that won't crash everyone's computers. I'm opening a separate issue from https://github.com/SpikeInterface/spikeinterface/issues/2627, since this will focus on implementation details and will not change the logic.

Summary of what exists:

User-facing API: users interact with times through extractor.sample_index_to_time and extractor.time_to_sample_index
- This means that we have freedom to change the implementation details without users knowing
There are two logical cases:
- time_seconds = t_start + sample_index / sampling_frequency, where t_start=0 by default
- time_seconds = time_vector[sample_index]

Summary of problems:

The first case causes no issues.
The second causes two issues:
- For an hour long recording, a time vector takes ~100M samples * (8 bytes / sample) ~= 1GB of memory. This becomes a problem for parallel processing.
- Going from times in seconds to sample indices with a time_vector currently means a np.searchsorted for each sample, which could get expensive for two reasons: time_vectors are long so individual searches can be expensive, and doing lots of them is even worse. edit: this is not really an issue, see discussion below.

I propose that we can still efficiently implement time vectors with two changes:

Store them in a SharedMemory when doing multiprocessing
~~Don't do a searchsorted for every sample input into extractor.sample_index_to_time~~ edit: seems fine.
- For instance, to speed up individual searches, we could initialize the search at (time_seconds - t_start) * sampling_frequency and look left or right depending on what value we find there.
- To speed up multiple searches, if the array of times we're searching for is also sorted than we can use a merge or something if it seems helpful.

I think this could be addressed by a PR that doesn't affect any other code, and would allow time_vectors in the motion correction interpolation.

cwindolf commented 1 month ago

A PR here could try to fix https://github.com/SpikeInterface/spikeinterface/issues/2515 as well

h-mayorquin commented 1 month ago

Question: In what case do we need to call time_to_sample_index repeteadly?

My prior is that np.searchsorted is not gona be the bottleneck for anything reasonable if we accept vectors in the corresponding functions:

So I think the main concerns is going to be memory as you were mentioning.

Can you remind me on how the time vector is used in motion interpolation?

cwindolf commented 1 month ago

Good point! Seems like the search is not going to be a bottleneck. Even a pretty aggressive version of your benchmark is fast:

In [11]: time_vector = np.arange(600_000_000.0) / 30_000

In [12]: search_times = np.linspace(500_000_000.0 / 30_000, 500_090_000.0 / 30_000, num=300)

In [13]: %%timeit
    ...: np.searchsorted(time_vector, search_times)
    ...: 
    ...: 
90.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In motion interpolation, sample_index_to_time() is called to determine each sample's time in seconds, and these times are used to look up the drift for each sample. time_to_sample_index() is not used, which would be the slow one -- if that operation even were slow, which as we can see it is not. So @samuelgarcia maybe there is no performance issue with having a time_vector in interpolation?

SpikeInterface / spikeinterface

Proposal for internal handling of `time_vector` #3029