NeuralEnsemble / python-neo

Neo is a package for representing electrophysiology data in Python, together with support for reading a wide range of neurophysiology file formats
http://neo.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
324 stars 248 forks source link

neo.rawio : API enhance proposal buffer_id and stream_id #1543

Open samuelgarcia opened 2 months ago

samuelgarcia commented 2 months ago

Here a proposal for small internal enhencement for rawio layer.

For getting analogsignal using rawio at the moment we have a parse_hedaer() that discover "signal channels" and how they are grouped in distinct "signal stream".

The notion of stream was bit ambiguous. Normally it is a group of signal channel that have the same: "sampling rate" + "size" + "dtype". In short, very often acquired by the same sub device but the notion was left to the class. Very often a stream used to be a set of channel in the same binary buffer of hdf5 buffer. Sometimes channels are in the same buffer but not in the same stream for instance:

  1. In openephys binary format when we have 72 channels : 64 are ephys channels and 8 are AUX channels.
  2. In spikeGLX the last channel (SYNC) is in the same buffer be can be removed using a flag. We should route this channel to a secondary stream instead.

Here a proposal in add a "signal_buffer" key in the rawio.header with a unique *id and then the "signal_stream" and "signal_channel" will have stream_id

Plan :

@h-mayorquin @zm711 @alejoe91

zm711 commented 2 months ago

I think the discussion point would be whether buffer_id is exposed at all or whether we only allow users to choose their stream. For example if I have old code where I expect my openephys binary to return all 72 channels and then I slice it later what would I do with the new stream_id mechanism (ie would I have to rewrite my code or would I have access to say just give me the full buffer)?

I would prefer to only have the stream exposed and make it clear that although we load as a buffer the user only accesses one stream at a time.

Some other ios that have this are: intan (for header-attached everything is one buffer, but we separate out the streams ourselves already) neuronexus (it is basically set up like openephys where you have x channels + 6 reserved channels all in the same buffer) plexon (where the wideband vs the filtered data is in the same buffer currently) plexon2 (as above)

So to make a tl;dr @samuelgarcia do you imagine:

reader.parse_header()
ana_chunk = reader.get_analogsignal_chunk(buffer_id=0, stream_id='1')

or

ana_chunk = reader.get_analogsignal_chunk(stream_id='1')

and we have a dictionary that maps all possible streams to their buffers?

Or to get verbose: do we every have the same stream go across multiple buffers or can we always determine the buffer based on the stream. In my experience it is 1 or more streams/buffer but not streams spread across buffers (unless we count some formats where you have one file/channel which is technically a different buffer but I don't think I would want to treat as a different buffer).

h-mayorquin commented 2 months ago

This is great.

In my view, buffer_stream or buffer_id should be an internal implementation detail that should NOT be exposed to the users. A very important point is that with this view the API will not change, we will just create an internal separation between buffer (is the data such that it can be memmaped from a single file) and stream id (does the signal logically makes sense) . More info here: https://github.com/SpikeInterface/spikeinterface/issues/3197

So it would be the second option of @zm711 above that does not change the API.

zm711 commented 2 months ago

I think Heberto and I are in complete agreement in this. Just want to make sure Sam and Alessio are also cool with buffer being completely internal, which would break returning the whole buffer if someone wanted it, though.

samuelgarcia commented 2 months ago

thanks for adding the link. and yes this is internal to the class. yes yes for sure the API will not be changed and so the stream_index will be the entry point for getting signals. this can be a usefull info for low level stuff.

this buffer_id story is to make the readers class more generic when the buffer api description will be done. At some point a signal_stream could be (without changing the API) a slice of a signal_buffer. That would make a super generic get_analosignal_chunks for many many class.

samuelgarcia commented 2 months ago

See #1544

samuelgarcia commented 2 months ago

Or to get verbose: do we every have the same stream go across multiple buffers or can we always determine the buffer based on the stream. In my experience it is 1 or more streams/buffer but not streams spread across buffers (unless we count some formats where you have one file/channel which is technically a different buffer but I don't think I would want to treat as a different buffer).

I am reading this last important stuff which is more complicated that expected (I am descovering it only now). What to do with files that split channels across files ? Technically it is differents buffer maybe we should describe this correctly. But then it is not what we have in mind : "a buffer contains one or more streams". (which was a very easy picture!) The buffer and stream concept then would be not nested but indepedant. and so the dtype of stream would not contain buffer_id. only channel would have buffer_id and stream_id I do not like this idea but maybe this would be more correct for "channel splited format". Maybe a call would be good at this point.

zm711 commented 2 months ago

Two examples off the top of my head are:

openephys legacy: https://gin.g-node.org/NeuralEnsemble/ephy_testing_data/src/master/openephys/OpenEphys_SampleData_1

intan one file per channel format: https://gin.g-node.org/NeuralEnsemble/ephy_testing_data/src/master/intan/intan_fpc_rhs_test_240329_091637