SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
495 stars 186 forks source link

Unify slicing behavior in SpikeInterface #1989

Open alejoe91 opened 1 year ago

alejoe91 commented 1 year ago

Related to #1881, #1886 and #1979 .

Let's discuss how we should handle the behavior of start_frame/end_frame/channel_ids in the get_traces(). Plus, let's discuss how to do add generalizing testing once we reach consensus.

Some proposals:

  1. Be restrictive:
    • end_frame > start_frame: passing start_frame == end_frame will raise an Exception
    • end_frame < num_samples
  2. Numpy-like

For the channel_ids is the behavior is already established, but we should consider adding tests for them when we add generalized testing.

h-mayorquin commented 1 year ago

OK, so initially I wanted to follow numpy as much as possible. I had two cases in mind. Out of bounds slicing where we should do clipping and start_frame=end_frame slicing where we should return an empty array.


In [1]: x = np.ones(10)

In [2]: x[15:13]
Out[2]: array([], dtype=float64)

In [3]: x[5:3]
Out[3]: array([], dtype=float64)

In [4]: x[10:10]
Out[4]: array([], dtype=float64)

Note that standard lists [ ] in python behave in exactly the same way

Using two arguments: 1) We already load most of our data as numpy memmaps (so we don't need checks) and then we also interact with scipy and numpy functions (and therefore should folllow their specification as they probably expect input with numpy like qualities). 2) We don't need to come up with a standard ourselves. "Let's just do what numpy does" is easy to follow and avoids the cost of coordinating /agreeing / discssuing between us if we disagree about a specific rule. We can then outsource policing to someone else, less coordinate costs for us.

But then I remembered that there is actually a standard in the matter that I think we should follow: https://data-apis.org/array-api/latest/

And this is what it has to say about indexing (quoting from the relevant sections): https://data-apis.org/array-api/latest/API_specification/indexing.html

1) Out of bounds check: This specification does not require bounds checking. The behavior for out-of-bounds integer indices is left unspecified.

Rationale: this is consistent with bounds checking for integer indexing; the behavior of out-of-bounds indices is left unspecified. Implementations may choose to clip (consistent with Python list slicing semantics), raise an exception, return junk values, or some other behavior depending on device requirements and performance considerations.

2) start_frame=end_frame slicing: the specification says the following:

If i equals j, a slice must return an empty array, whose axis (dimension) size along the indexed axis is 0.

I think we should just follow that. We can start second guessing standards because we think we know better but I think we increase costs for interaction with the ecosystem and we increase costs of discussing between us if a non-standard rule were we disgagree comes in the future.

So my reading of the standard and my current position that is fine that we throw error for end or start frame outside of bounds (as this is unespecified behavior) but that we should return an empty array with the dimension of the channels if start_frame=end_frame. Plus, if any other disagrement comes in the future, we should defer to the array standard.

JoeZiminski commented 1 year ago

This is a great post and important question with no clear best-answer. I think the decision depends in part on the broader aims of spikeinterface and the intended audience for which it will be used. On one extreme, there is a highly restrictive API which does not allow any indexing procedure that could be interpreted as a mistake. i.e. indexing the recording out of bounds, or with start_frame >= end_frame. On the other extreme is a very flexible API with few checks that could allow unintended operations to pass silently.

The benefits of a more restrictive API is 1) easier to track errors and detect bugs 2) enforces very explicit handling of all use cases 3) easier for non-expert users to detect mistakes in their code. The downside is that it increases initial overhead when writing code, and breaks with existing data-slicing API, and is less flexible.

At present, I lean towards a highly restrictive API that will throw an error if any potentially unintended indexing operations are performed. Although this might lead to some initial overhead to explicitly handle all indexing cases, I think it will pay dividends down the line with increase ease of bug-fixing and reduced number of user-reported issues.

Also, I think that SI should aim to be the de-facto tool for in vivo electrophysiology analysis in Python, inevitably drawing novice Python users with it's easy-to-use API. Often in practice those analysing electrophysiological data will have little computational training. Having robust checks on common indexing mistakes will make SI more secure and save a lot of headaches for new users.

Fundamentally, I think the problem of allowing such indexing operations is it is impossible to tell whether this indexing was done on purpose, or is due to a mistake or oversight. The possible options then are to allow both intentional and unintentional indexing of this sort (i.e. flexible API), or not allow either (restrictive API). I think in general it is safer to handle the intended cases at a higher level than get_traces(), and catch all unintended cases, rather than let both intended and unintended cases pass.

h-mayorquin commented 1 year ago

Tagging @CodyCBakerPhD and @pauladkisson as we usually model roi extractors after the conventions in spikeinterface.

pauladkisson commented 1 year ago

Restrictive indexing makes sense to me -- I just wrote some input validation for the MultiPlaneScanImage (see PR) to prevent out-of-bounds channel and plane indexing. Otherwise when the specified channel/plane number exceeded the actual number of planes/channels, some other frame was returned (silent error) due to the way that ScanImage stores its data (sequentially in a round-robin format).

Seems like restrictive indexing is the much safer way to approach this.

samuelgarcia commented 1 year ago

My feeling is: we should go to more strict

In short raise if:

Mainly because:

In short, less work, less code and less bugs. I am lazy. it is a known fact.

h-mayorquin commented 1 year ago

The point that @pauladkisson brings to the table moves me in the direction of more restrictive and away from my initial position. We indeed deal with multiple segments in SI by default and I can see going over the first segment and getting no errors being even more confusing in this context.

h-mayorquin commented 11 months ago

So, it seem that everbody wants to throw errors for outside of bounds (including negative). I was the only one that was negative on that but I have been convinced by you guys and the standard not having specification for that behavior. Let's say we give this another week and if we don't have any strong arguments by 2023/10/06 we run on with the restrictive version. That is, we will raising errors if:

What is less clear to me is what should we do for the start_frame=end_frame case where the standard urges us to return an empty frame. Or the more general end_frame > start_frame where the standard is not clear at all on my view. I myself l am leaning to follow the standard but Alessio included this point under the restrictive approach and most people voted for that. So we should probably move forward with having the restrictive view as well. That is, we will move forward with throwing assertions if:

Sombeody else that is not me put a date on that one as this is too much of a hard pill to swallow : )

DradeAW commented 11 months ago

@h-mayorquin You're not alone!

I know my comment probably won't change anything, since almost everybody prefers throwing errors, but I still wanted to give my opinion (but I respect the fact that the majority wins).

I also prefer doing it the "numpy way", where giving end_frame == start_frame returns an empty array. I think at some point, I used the fact that it didn't give an error to use "out-of-bounds" values. I think it can be useful (although yes, it can hide some errors), and since Python and Numpy does it this way, it makes more sense.

JoeZiminski commented 11 months ago

These are very good points, and the current discussion on documentation reminds again how important it is to check what the other main packages are doing and follow as best as possible. So any divergence should be done with very good reason.

In this case, I think there is a reason to deviate from the slicing behaviour of Python / Numpy. The purpose of these tools are to provide general purpose array structures that support a very broad number of cases. I can see how in the generalised concept of an array, trying to index end_frame == start_frame makes perfect sense, and should return an empty array. i.e. the question "What are the array elements between index 1:1" receives the answer "There are no array values between the indices 1:1".

However, in the context of spikeinterface, the concept is not of a generalised array but of an extracellular ephys timeseries. In this case, it does not make sense to try and index from and to the same timepoint, especially in the case of get_traces() after preprocessing in which it is commonly used. For example, I feel the question "Get me the pre-processed data between time 1s and 1s" should receive the answer "Why are you trying to do that? there are clearly no timepoints between the two times you have provided" (where of course timepoints are expressed as indices in the API, but what we are really representing is timepoints) .

h-mayorquin commented 2 months ago

One think that we did not discuss is whether we follow numpy convention of accepting slicing with None:

In [1]: np.arange(10)[None:]
Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [2]: np.arange(10)[None:None]
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [3]: np.arange(10)[:None]
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

As a proponent above of "doing what numpy does" I think we should but given that slicing police side won above maybe you also want to throw an error here? @JoeZiminski @samuelgarcia @alejoe91 @samuelgarcia

I will add this to the next maintenace discussion topic so we can agree on this, wrote some guidelines and maybe close this issue.

This comes from the discussion in #3159 where the solution depends on the decision.

JoeZiminski commented 2 months ago

Good point @h-mayorquin for this I think if the user has explicitly passed None its okay to default to Numpy-like behaviour. I think this is what is done by default for BaseRecording.get_traces() as None is default args there?

h-mayorquin commented 2 months ago

It is a good point that taking the opposite stance would break backwards compatbility. I prefer to comply as much we the array standard as we can and this does:

https://data-apis.org/array-api/latest/API_specification/indexing.html#api-specification-indexing--page-root

Probably everbody agress with this one.