SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
517 stars 186 forks source link

Clarification for `get_traces`'s `channel_ids` #3049

Closed b-grimaud closed 3 months ago

b-grimaud commented 4 months ago

When calling get_traces, I instinctively (and perhaps wrongly) listed channel_ids as a list of int. Which results in :

ValueError: <id> is not in list

Going through the trace I saw that ids_to_indices looks for the selected channels in _main_ids, but when checking that array I noticed that the declared dtype :

array(['1', '2', '3', ..., '4094', '4095', '4096'], dtype='<U64')

doesn't match that of the individual values :

type(recording._main_ids[0])
numpy.str_

I checked for the file types I had on hand (BioCAM's .brw and MaxWell's .h5) and both had similar file types in their channel arrays.

I know that, at least for Biocam files, the neo header parsing function returns an array of int, and I couldn't find exactly where the type changes.

Either way, that explains the error and passing channel_ids as a list of str works fine.

However, while looking through the issues to see if this was brought up previously, I came across #2320 where it is stated :

A single string is also an iterable but channel_ids can't be strings.

Which leads me to the following questions :

Is it possible to clarify, in the docstrings and maybe the type hints, which type should the iterable contain ? And would it be useful to assert that type or convert it if one is passed instead of the other ?

Apologies if this is somehow specific to the file types I've tried, in which case I'll go through neo again.

Running SI 0.101.0rc0 from source, Python 3.12.2, Ubuntu 24.04.

zm711 commented 4 months ago

Salut Baptiste!

So we have a few different things coming from Neo: channel_names, channel_ids, and channel_indices. Names can be anything so that's why we don't typically use them (some ephys readers let people apply custom names which means we can't guarantee uniqueness). The channel_ids are the hardcoded names (that the hardware supplies). Sometimes these are the same as the indices if the hardware doesn't give us a unique identifier (and so we just use the indices as the ids). At the spikeinterface level they can be strings or ints, but since neo returns the ids as strings (I think always... @samuelgarcia , @alejoe91 , @h-mayorquin ) for a neo extractor you would give them as strings.

However, while looking through the issues to see if this was brought up previously, I came across https://github.com/SpikeInterface/spikeinterface/pull/2320 where it is stated :

A single string is also an iterable but channel_ids can't be strings.

Sorry there was context to that you are missing. What this really means was that the type of channel_ids needs to be a list or array and that channel_ids itself can't be a string. An individual channel_id can be a string.

channel_ids = 'channel 1' # this is bad for the argument
channel_ids = ['channel 1'] # this is good for the argument

The root issue is that some functions allow the user to provide an individual item rather than a list of items (even if that list contains one thing). So we were emphasizing that the argument needed a container rather than the item itself.

Now back to your question.

Going through the trace I saw that ids_to_indices looks for the selected channels in _main_ids, but when checking that array I noticed that the declared dtype :


array(['1', '2', '3', ..., '4094', '4095', '4096'], dtype='<U64')
doesn't match that of the individual values :

type(recording._mainids[0]) numpy.str



A dtype of `'<U64'` means string. So I'm not sure what you mean by this. The ids are an array of strings. Does that make sense? 

As far as adding the type info I think that is fair. But the issue is they can currently be either string or int. (Here is the debate #1432 in case you're interested). 

I've written an essay so let me know if something isn't clear.
h-mayorquin commented 4 months ago

To avoid this type of confusiong you can just call recording.get_channel_ids() and you will see the type, shape and form of the ids.

I think the first error can be improved though to tell what are the ids so you would not have had to go through all the trace. Let me work on that.

I think that is good that @zm711 points out #1432 as it will avoid this type of errors if we always had stings (altought @b-grimaud mentioned that his natura instinct was ints so that's an argument to the other side).

h-mayorquin commented 4 months ago

I opened a PR #3052 which hopefully will make someone else avoid the hops that you had to go through to find this out. Now if you repeat your workflow above a more helpful message will be displayed:

image

b-grimaud commented 4 months ago

@zm711 Thanks a lot for the clarification !

What this really means was that the type of channel_ids needs to be a list or array and that channel_ids itself can't be a string. An individual channel_id can be a string.

I see, I think I misunderstood channel_ids in this context as several individual channel_id instead of the argument as a whole. So if I understand correctly it has to be an iterable of strings, except strings themselves can be iterables of strings hence the confusion.

The ids are an array of strings. Does that make sense?

I went back into the numpy dtype docs and things cleared up on that point, I think I wrongly understood the array type as uint64.

@h-mayorquin

Thanks for the PR, it does make things a lot clearer in my opinion !

(altought @b-grimaud mentioned that his natura instinct was ints so that's an argument to the other side).

To be fair, I think this is mostly because the data I work with uses ints for indexing so I've always assimilated ids and indices as equivalent. The example of an error trace you posted shows that someone used to a different system might not have that same first thought.

h-mayorquin commented 4 months ago

Yes, extractors contain arrays under the hood and we need to make a distinction between indices that are used to access content (array_of_data[indices]) and channel_ids which are format defined identifiers to refer to the data.

Glad you think the PR will help other people in the future. We aim to save users time!

zm711 commented 4 months ago

This is another example of why we should create a glossary of the library (as @JoeZiminski has suggested). We have this in the development docs, but the end-user likely doesn't read those.