[Discussion] Improve the way we handle signals of different nature when they come in the same stream

h-mayorquin commented 1 month ago

On neo the concept of a stream indicates that the underlying data has the same:

dtype
shape
sampling_rate

And that makes a lot of sense in neo raw io because they can be thought off as a block for IO and processing while loading and accessing data. It achieves its purpose there.

Because it is readily available we also use the concept of stream in spikeinterface to load data from spikeinterface. The reason for this is that having the same shape, sampling rate and dtype are characteristics that a buffer of data requires to be loaded as a recording. The purposes match at that level.

Where the purposes don't match is on the application purpose of recording extractors objects. Let's take a common one which is sorting and two cases to illustrate.

plexon: currently both wide band and filtered signals are in the same stream. See here: #3196
CED from Cambridge Electronic Design: for our test example in gin raw, LFP, mechanical and even laser recordings are stored in the same stream. See this issue on neo.

I hope that this two situations illustrate that loading a recording and then trying to run a sorter or most analysis of the data will not make sense under those cases. The data needs further separation so it can be feed to a typical spikeinterface pipeline.

This is a discussion issue that introduces context so we can make a decision:

Can this be improved on the spikeinterfae side?
Should we improve the concept of stream or come with a new concept that ensures that only electrophysiological data of the same nature is loaded when users access data through our extractors API?
Are those cases above rare enough that we can just document those and not do anything else?

This is also important for us in neuroconv because usually we write a whole stream as an ElectricalSeries which does not work so we need further refinements to achieve our purpose as a curators of data.

Tagging @CodyCBakerPhD and @bendichter from the neuroconv side.

zm711 commented 1 month ago

I would argue for Plexon we should split the wideband and the filtered into different streams. Although strictly they could qualify as the same stream, for me it makes more sense to be different streams since it is the exact same data, but filtered vs unfiltered. CED I don't know at all.

By splitting plexon into two streams the user can easily select to use wideband and filter in spikeinterface/the sorter or choose to use the plexon filtered data. Curious what others think.

samuelgarcia commented 1 month ago

Good analysis of teh situation. Even spikeglx last channel could be handle as a separate stream.

h-mayorquin commented 1 month ago

Agreed Solution

The agreed solution is to improve the concept of a stream, keep the name as it is, and fix it fundamentally in neo raw io.

Context

It is unclear what a "logical stream" is, but people have come to have expectations about it because it is exposed in our API. To make this more precise, let's build a provisional characterization of what a "logical stream" is.

A logical stream should:

Be a buffer stream in the IO sense: it should have dtype, sampling_frequency, and shape so it can be thought of as an IO block.
An analysis pipeline such as spike sorting should make analytical sense when loaded in spikeinterface or neo.

A logical stream should not:

Have channels that have different units.
Have channels that have different filtering.

We can add or remove points from this characterization as we move forward.

How to Implement It

The current concept of a stream in neo raw io will be renamed (provisionally "buffer stream") and hidden from the user. In other words, it will become an internal implementation detail of neo. In practice, this probably means expanding the neo header struct with another field that will characterize whether the signal can be loaded as one buffer and then sub-dividing this into logical or sub-streams that are exposed to the users. Note that in some cases, this will imply a small inefficiency as more data than is needed will be memmaped, but this can be mitigated and has a low cost overall. This is a lot of work that will happen at the neo level, so the rest of the details should probably be fleshed out there.

Some Things This Does Not Cover

There are cases like intan where the current stream is not narrow enough, but the "logical streams" cannot be determined from the format alone. For example, the channels of different ports might come from different probes, and the criterion of "you have to be able to run an analysis pipeline on the stream" does not apply neatly. In that case, it will be the responsibility of the user to partition the stream appropriately. I don't see what else we can be done there.

h-mayorquin commented 1 week ago

Here is another case of non-logical streams that came to us in neuroconv: https://github.com/catalystneuro/neuroconv/issues/1023#issuecomment-2307504307

zm711 commented 1 day ago

There are cases like intan where the current stream is not narrow enough, but the "logical streams" cannot be determined from the format alone. For example, the channels of different ports might come from different probes, and the criterion of "you have to be able to run an analysis pipeline on the stream" does not apply neatly.

I think in the intan port case we know they come from different ports. I think the right thing to do in this case is to only do one port at a time. Because it is one port per headstage which would mean one port per probe. So someone could easily sort their data separately and as far as I know it would be better to preprocess those ports separately. So in this case I think we would want to eventually hive that off in Intan since logically the ports are like doing separate experiments and so ought to be treated as separate streams.

h-mayorquin commented 1 day ago

@zm711 To add more context, I just worked with an experimental setup where the arrangement was the following (quoting):

You’ll see three tabs, as the rig can accommodate up to three Utah arrays: (1) the first array uses Ports A, B, and C (32 sites each), (2) the second uses Ports D, E, F (32 sites each), and (3) the third, if present, uses Ports G (32 sites) and H (64 sites).

For Spikeinterface purposes, I would rather give users too much data (that is memmaped anyway) so they can slice instead of too little, so they would need to open two recordings and then concatenate.

Would you still maintain your view in the light of this case?

zm711 commented 1 day ago

This is a tough one. Let me think on this case. utah arrays are their own special case. Often the electrodes 1) lack a rigid geometry and 2) are spaced enough they should be treated as small sets of electrodes rather than as one overall probe. I guess I don't care enough to fight if you feel strongly about keeping this together and having the user slice rather than have us slice and make the user concatenate/append. I guess I'm slightly in favor of your approach since I don't think we necessary have a channel_append machinery that would make this easy whereas we have a channel_slice machinery. So from a spikeinterface side it is much easier to slice in this situation.

thanks for providing this example. I would be curious how the utah setup actually works but I do remember you asking me about ports in general ages ago and bringing up this setup as an example of multiport use. Intan does provide spi cable splitter/adapters which would allow them to merge multiple headstages to one port if you really wanted them to go into one port only which I think would be the better thing to do to fit with our schema, but end-users can do whatever they want. So your solution reduces the decisions that we make vs the end-user makes.

I'm fine with keeping amplifier all coming out regardless of port. I think writing this has pushed me more toward your camp.

SpikeInterface / spikeinterface