NeuralEnsemble / python-neo

Neo is a package for representing electrophysiology data in Python, together with support for reading a wide range of neurophysiology file formats
http://neo.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
325 stars 248 forks source link

Plexon2RawIO's implementation of _get_analogsignal_chunks can be improved #1569

Open nikhilchandra opened 2 months ago

nikhilchandra commented 2 months ago

Is your feature request related to a problem? Please describe. The latest version of the PL2FileReader API includes a function capable of retrieving subsets of data from an analog channel. Switching _get_analogsignal_chunks to this function should (hopefully) remove the need to load entire channels into self._analogsignal_cache (this takes up a LOT of memory) and might also speed things up.

Describe the solution you'd like I want to do the following:

Describe alternatives you've considered There are none that I can think of.

Related Question What is the proper way to handle requests that exceed the limits of a recording's segments? For example, suppose I have a multi-trial recording in which 2 recording segments run from [0, 10] and [50, 60] (units in seconds). Suppose the user requests data from [5, 15] seconds -- obviously we have data from [5, 10] from the first segment. But for the interval [10, 15] -- do we just return zeros?

zm711 commented 2 months ago

Sounds cool to me. I'm sure @h-mayorquin is also interested.

h-mayorquin commented 2 months ago

That sounds great. Concerning the question of what to do when asking for out of bound data. That is not clear to me, is this something that you have made a made a decision about @zm711 ?

samuelgarcia commented 2 months ago

Hi. On spikeinterface side we decided to raise an error. On neo side this have never been clear. I think the most often case is to clip not to zero pad.

zm711 commented 2 months ago

I'm looking right now and a bunch of the memmap ones just do slicing [start:stop, :] so that would just follow whatever the numpy memmap convention allows, no. Where is the clipping machinery?

nikhilchandra commented 1 month ago

@samuelgarcia Please, could you clarify what you mean by "clipping"?

samuelgarcia commented 1 month ago

Hi sorry just reading this. I am a bit fuzzy. cliping was for : if stop in [start:stop, :] si greater than the shape[0] then stop is is "clip" to shape[0]. In short very often neo behave like numpy for upper boudary. but spikeinterface do not, it is more strick.

nikhilchandra commented 1 month ago

@h-mayorquin It looks like plexon2rawio.py downloads the relevant DLLs from this location:

https://raw.githubusercontent.com/Neuralensemble/pypl2/master/bin/{file_name}

where "file_name" can take on the value PL2FileReader.dll or PL2FileReader64.dll. How does one go about submitting updated DLLs to this location?

nikhilchandra commented 1 month ago

Another problem -- the functions _segment_t_start and _segment_t_stop take block_index and seg_index as arguments. However, for PL2 files the precise start/stop times for a given segment depend on the stream. For example, here are segment start times for 3 wideband and 3 field potential channels from the same recording.

WB001 segment timestamps: [ 67 3609805 7217229 10824175 14430721 18036311 21642619] WB002 segment timestamps: [ 67 3609805 7217229 10824175 14430721 18036311 21642619] WB003 segment timestamps: [ 67 3609805 7217229 10824175 14430721 18036311 21642619]

FP001 segment timestamps: [ 67 3609827 7217267 10824187 14430747 18036347 21642627] FP002 segment timestamps: [ 67 3609827 7217267 10824187 14430747 18036347 21642627] FP003 segment timestamps: [ 67 3609827 7217267 10824187 14430747 18036347 21642627]

Note that within streams the values are identical, but across streams they differ slightly. I am not sure what the correct approach is to incorporate these observations into the existing framework.

Note: all values are in units of ticks, not seconds.

zm711 commented 1 month ago

This is because the data is organized in packets?

How could the time stamps of the Wideband and field potential be different if it is the same data just filtered differently? Unless the headstage or equipment applies a slight delay to do the onboard filtering and then sends that stream with a slight delay?

nikhilchandra commented 1 month ago

WB and FP sampling rates are 40 kHz and 1 kHz, respectively. Each recorded sample is assigned an integer value (tick) that corresponds to a clock running at 40 kHz. Therefore, successive samples recorded for WB have tick values that increment by 1, while successive samples recorded for FP have tick values that increment by 40. Each segment timestamp in my previous message equals the tick value of the first sample recorded for that segment. Here are some predictions based on the preceding statements:

(1) For a given segment, the first recorded samples for WB vs FP channels should have tick values that differ by a maximum of 40 (2) For FP, tick values for all recorded samples must differ by a multiple of 40.

The attached calculations from an Excel spreadsheet show that both predictions work out.

image

In the row "WB-FP", we see that the difference in tick values for FP and WB segments never exceeds 40.

In the row "FP(n) - FP(0) % 40", we see that the difference in tick values for each FP segment compared with the first (67) is always a multiple of 40 .

I suspect that the correct course of action depends on exactly how _segment_t_start and _segment_t_end are used, will keep digging and make recommendations based on what I find.

nikhilchandra commented 1 month ago

Could someone please provide insight as to how to incorporate new 32- and 64-bit DLLs into the project (see my previous message from last week)? Thank you! @zm711 @h-mayorquin

zm711 commented 1 month ago

Either @h-mayorquin or @JuliaSprenger set this up originally. So to be honest I don't know. Hopefully Heberto can comment on how he set it up. I only helped with the shell commands for it :)

h-mayorquin commented 1 month ago

HI, the DLLs home is here:

https://github.com/NeuralEnsemble/pypl2

I guess the route would be to make a PR to this but I don't know the ownership situation of that repo

@samuelgarcia @apdavison?

Maybe I or @nikhilchandra could take ownership over it now that Julia is sadly not as present as she used to be.

zm711 commented 1 month ago

Just checked and I definitely don't have any writing permission there. We likely need to get Julia to share with a couple people so we don't lose access to it !

apdavison commented 1 month ago

I've given the Neo maintainers plus @h-mayorquin access to the pypl2 repo

h-mayorquin commented 1 month ago

@nikhilchandra feel free to make a PR there.

zm711 commented 1 month ago

@samuelgarcia and @apdavison could you comment on the timing question as well.

nikhilchandra commented 1 month ago

@h-mayorquin workin' on it :)

zm711 commented 1 month ago

@samuelgarcia comment if I'm misquoting you please, but I asked about your other question Nikhil and the idea was the t_start is just the minimum of the segment since there can be staggered start times for the exact reasons you mention but the actual segment starts whenever the first stream of data starts. Again Sam explained this to me in the meeting, but now I'm writing this after a few days, so it might not be completely clear.

nikhilchandra commented 1 month ago

@zm711 @samuelgarcia After digging through the code some more my understanding is that _segment_t_start and _segment_t_end are stream-independent, while _get_signal_t_start is for streams.

The pypl2lib library includes a stream-independent function called pl2_get_start_stop_channel_data() -- this function is independent of stream, so I have predicated _segment_start and _segment_t_end on that. Meanwhile, _get_signal_t_start does depend on stream, and the relevant information can be gathered from the "fragment" variables returned by pypl2lib's get_analog_channel_data().

zm711 commented 1 month ago

Yes the idea is that a segment can contain multiple streams of data so the overall start of the segment is stream independent because it is whenever the first stream starts within that segment. The signals are the streams and so those rely on the actual stream being assessed.

h-mayorquin commented 1 month ago

Any place where we could add the documentation of the meaning of these methods? I remember being confused about it as well at some point.

zm711 commented 1 month ago

You want to tack your comment to #1508. That is the reminder to make our rawio docs better :)

nikhilchandra commented 4 weeks ago

There is some nuance to this problem that I did not fully appreciate earlier. I need to decide what to do -- some input would be appreciated @zm711 @h-mayorquin @samuelgarcia @alejoe91

The OmniPlex system supports trial-based recordings, which means you can pause/resume recordings as needed over the course of an experiment. While SpikeInterface deals with "segments," the PL2 format uses "fragments." Unfortunately, the concept of PL2 fragments does not map perfectly to SpikeInterface segments, as I will show below.

When you start, stop, pause, or resume a recording, this creates a timestamped event. Information about such events can be retrieved with 2 functions in the PL2FileReader API -- PL2_GetStartStopChannelInfo and PL2_GetStartStopChannelData. Note -- these events are not tied to any stream.

When retrieving analog channel data (what SpikeInterface would refer to as a "stream"), one has the option of using the API's function PL2_GetAnalogChannelData. This returns 5 variables:

  1. num_fragments_returned - the total number of fragments returned
  2. num_data_points_returned - the total number of data points returned (across all fragments)
  3. fragment_timestamps - array containing first integer timestamps for each fragment
  4. fragment_counts - array containing number of data points for each fragment
  5. values - array containing data for all fragments, stored contiguously

For a recording with multiple segments, the function returns all data points for all fragments in a single, contiguous array. We need to use fragment_timestamps and fragment_counts to make sense of things.

HOWEVER

Suppose we conduct a recording in one continuous block, i.e., we do not pause/resume the recording at any time. This means that # of segments == 1. Such a recording can still have multiple fragments if the OmniPlex hardware gets overloaded so there are gaps in the data that actually gets recorded. In effect, there is a mismatch between # of segments and # of fragments. The gap need not be very big. I am currently playing around with a file in which the following values are returned for a supposedly continuous recording:

num_fragments_returned = 2 fragment_timestamps = [61, 8200645] fragment_counts[8196286, 41286384]

The first fragment occupies the tick-interval [61, 8196347) The second fragment occupies the tick-interval [8200645, 49487029)

So the gap between the first and second fragments is 4,298 samples. Given the 40 kHz sampling rate, this translates to roughly 107 ms of missing data.

PROPOSED SOLUTION

I think we should use the information from PL2_GetStartStopChannelInfo and PL2_GetStartStopChannelData as the ground truth -- if this information indicates that there is more than 1 segment, then reject the file for processing.

At the same time, if the number of fragments returned exceeds 1 due to one or more gaps in the data, we should do the following:

  1. Issue a warning
  2. Apply a threshold to determine whether the gap is acceptible. a. If the gap exceeds the threshold, break b. If the gap is less than the threshold, just fill it in with zeros.

Questions:

  1. If we do go with my proposed solution, then what would be an acceptable gap threshold? 1 second? 500 ms?
  2. Would we want to add this threshold as a parameter that can be set by the user somewhere?
zm711 commented 4 weeks ago

I will read your comment soon. Today I'm swamped but I will try for the next couple days :)

h-mayorquin commented 4 weeks ago

Same here, this is on my radar but this week is super busy.

samuelgarcia commented 3 weeks ago

@nikhilchandra I totally agree with, the segments concept in reader should be revised it fuuzy used for handling gaps (like in neurlynx) and handling wanted pause during recording.

I would propose a call to discuss that with @zm711 @h-mayorquin and maybe @alejoe91.

I think we should have a clear gap handling API directly in neo.rawio and the user could decide the tolerance of the gaps and optionaly could retrieve the timestamps of samples. This is a massive task.

nikhilchandra commented 3 weeks ago

@samuelgarcia Agreed

@zm711 @h-mayorquin @alejoe91 Let me know if you'd like to do a call sometime

nikhilchandra commented 2 weeks ago

@samuelgarcia I am not sure if I should just proceed with my changes or if some discussion is necessary first, based on preceding comments. Please advise.

h-mayorquin commented 2 weeks ago

Sorry for the delay on this. We all have been very busy. We will discuss this this week, would you like to join? I will send an invite by email.

nikhilchandra commented 2 weeks ago

I have responded to your email. See you soon!

h-mayorquin commented 1 week ago

OK, we discussed this today. Sam proposed a general way of thinking about this but I am leaving here the relevant context to move this forward:

Semantic distinction between segments and gaps: Segments are international gaps are not.

In the ideal case we have an automatic way of telling this. Intentional jumps in time become segments non-intentional ones becomes gaps. Gaps should not be segment and we should create an API to collect them.

Nikhil action items:

I will open another issue on python-neo to discuss the API thing.