[NeuralynxIO] Update .ncs file reading

TRuikes commented 4 years ago

Hi! While working with NeuralynxIO in different datasets I came across a few bug/missing features, which I think will be useful (necessary) additions:

Input is automatically inverted, I was not aware this was happening, but is important to know.
Using regexp breaks, in two cases: 1) somehow the header string type does not match the application version (happens if I save old NL data using a new version of Neuraview) 2) if we have this in the header: ## Time Closed File was not closed properly; we don't know how it got there, but the data itself seems intact
NCS files within one session can have different data lengths, causing this to break. When DspFilterDelay_µs > 0 and DspDelayCompensation == Enabled, there seems to be padding of the data by Cheetah (im not 100% confident about how). For a tetrode, someone would have 3 channels saved with DspFilterDelay==0, but one is being bandpassed (DspFilterDelay >0), yielding patches before and after the signal if the Dsp Delay compensation is on.
If DspDelayCompensation == Disabled and DspFilterDelay > 0, this is not dealt with, and one would have channels with improper alignment
Ring buffer gaps are detected as gaps, and translated as different segments. In short, if the buffer during recording overflows, Cheetah will drop the data, making the recorded values discontinuous. Still Cheetah will save the timestamps for the signal, so it is solvable. In fact you are allready reading those gaps with this line, but treat them as segments, instead of faulty data. Fieldtrip has a nice implementation to deal with this; ft_read_neuralynx_interp, wherein they fill the gaps with NaN's. This ensures for instance that when generating .dat files the data is properly aligned.

The header thing will be solved easily, but the delay and buffer errors are a bit more though, I you would have to restructure how .ncs files are read. Currently the first file in read, used to determine gaps/data size shape, and it is ignorant to any Dsp delays.

Maybe this workflow would solve it:

Scan all .ncs files; per file get:
- t_start & t_stop
- find presence of ring buffer errors/gaps
- Dsp delay & compensation
Define segments, a common t_start and t_stop
For each .ncs file, patch missing data (or alternatively have the 'broken' channels dropped)

I can this a try for a pull request if you want, the only thing I'm not sure about how to solve is how to patch the data, as there's np.memmap being used.

JuliaSprenger commented 4 years ago

Hi @TRuikes! Thanks for the very extensive investigation and list of references. Regarding the individual points:

Yes, the signal is automatically inverted if the input_inverted flag is set. However, if you think this is unintuitive, feel free to change it, since I implemented this rather on a hunch and not with a specific use case in mind.
Feel free to extend the regexp construct to cover the cases you report. If possible, it would be nice to also have test files covering these cases. Do you have any small files that you could contribute here?
I am not sure I understand the mechanism of the DspDelayCompensation completely. Same here, if you have a file exhibiting this feature we can investigate how to deal with this.
For the gap detection I am not convinced that filling missing values with NaNs is a good solution here, since I would expect analyses to rely on real numbers provided from a recording system. Patching it with NaNs might lead to unexpected behaviour here. Splitting the data into different segments sounds like a cleaner solution to me. @samuelgarcia: What's your opinion to this?

samuelgarcia commented 4 years ago

Hi @TRuikes. Thank you for this long analysis. Feel free to make proposal with PRs.

Julia have a better knownledge than me about format specification.

About the inverted input. Do you want to say, it is a bug or it is a feature ? Ot would like a karg to control that `compensate_interved=True/false

About regexp for datetime we can have dataetime=None when it fail, I agree.

About DspDelayCompensation and DspFilterDelay, I don't known I let Julia anwser.

About gap, I am OK to have a karg that cocontrol the control the behavior when gap. Like gap_mode='multi_seg' / 'fill_nan'. It is a lot of works. keep in mind that this is a neorawio and so it have to lazy, not loadinf when parse header. Every load is on the fly. This is why we need to keep this memmap system intact.

samuelgarcia commented 4 years ago

Anwser at same time. I think we can have both behavior possible. But it is lead to more difficult code.

TRuikes commented 4 years ago

thanks for the responses. Ill upload examples when I can.

input_inverted I would see it as a feature, it's a nice functionality for sure. But what happend now is that the 'researcher' told me he recorded his data with 'inverted' on, then I use this IO to load data and generate a .dat file for klusta (using spikesorting in fact), and in the parameters I ask klusta to look for negative peaks. For me a line in the docstring would be enough, so I would be aware of this. If this is default behaviour of all IO's in neo you can probably mention it somewhere in the documentation as well.
Ill extend the regexp cases, in the sense that, if it breaks it will give a None.

The Dsp thing will become more clear with the data, shortly summarized, if there is no compensation, but there is filtering, Cheetah will filter raw data, filter it and save the datapoints at the timestamp of recording, instead of compensating for the lag by the filter (and save the data at timestamp + filter lag).

For me the main problem with having it solved with segments, is that you will end up with 6 segments for one .ncs file and 4 for another. This will break the API we build ourselves around neo because we assume 1 segment (having it deal with a variable amount of segments of different lengths doesn't seem great), and I think this is not the behaviour you would like either?. I wouldn't fill them with NaN's either, I'd prefer zero's or the signal mean, I thought I'd start off with pointing out this thing where we might need to fill in the data.

I haven't thought much further than this yet, but current behaviour can be extended with:

new kargs: patchdata=False, patchpath=dirname <- path where to store patched .ncs files
during IO.read_ncs if patchdata=True and the data is broken (and has not been patched before) call IO.patch_data, which will write the data into patchpath.

This way you'd have to run the patch functions only once per dataset, after this the data should be compatible with the current IO.

I'm not sure if the broken data is a recurring thing, and worth the effort to be properly dealt with in neo. I think it would be usefull, and perhaps a selling point (right now I could also patch the data with FieldTrip, Matlab and then start using neo).

sin-mike commented 4 years ago

Hi, @TRuikes I've been working with Neuralynx for about 5 years now in both, Matlab and Python. I'm not a maintainer, however, I've studied neo interface thoroughly and I can answer make some clarifications.

Input is automatically inverted, I was not aware this was happening, but is important to know.

A:Input inverted is a reasonable feature when working with positive/negative fields, like reversions. So, having it set by default is being agnostic to the experimenter's setup.

NCS files within one session can have different data lengths, causing this to break. When DspFilterDelay_µs > 0 and DspDelayCompensation == Enabled, there seems to be padding of the data by Cheetah (im not 100% confident about how). For a tetrode, someone would have 3 channels saved with DspFilterDelay==0, but one is being bandpassed (DspFilterDelay >0), yielding patches before and after the signal if the Dsp Delay compensation is on.

If DspDelayCompensation == Disabled and DspFilterDelay > 0, this is not dealt with, and one would have channels with improper alignment

A: Data nonuniform alignment is a 'specific' feature for Cheetah, as not much of other ephy systems allow doing that. We, in our lab, are advising not to record with configs allowing that or non-uniform sampling rates (some feature we've heard of, but not using it too) simultaneously. Cases of non-uniform alignment are subject to manual processing, as it depends on how exactly you want to fill the gaps. The default approach for FT filling with NANs could be sooooooo RAMy, e.g. you have only 15 secs recorded of a minute with other 45 secs for a pause for 100 trials or so in one record. Thus FT would increase RAM needed for your record 4 times. Imagine then if you have 100 channels at 32kHz... My PC just blew. The gap-to-segment strategy is pretty straightforward and consistent with other recording systems that allow pausing.

Ring buffer gaps are detected as gaps, and translated as different segments. In short, if the buffer during recording overflows, Cheetah will drop the data, making the recorded values discontinuous. Still Cheetah will save the timestamps for the signal, so it is solvable. In fact you are allready reading those gaps with this line, but treat them as segments, instead of faulty data. Fieldtrip has a nice implementation to deal with this; ft_read_neuralynx_interp, wherein they fill the gaps with NaN's. This ensures for instance that when generating .dat files the data is properly aligned.

A: I've partially made this happen. When I bumped into this problem, with lost samples, I've found reconstructing the signal from raw timestamps to be quite ambiguous. The thing is that you're always reading from handy memmaps, and then reshaping for fit blocks into a continuous trace. I've found that having it done would be time-consuming/memory consuming using interpolations or masking traces, and considering the buffering errors are quite few, it is easier to process them as individual segments and then concatenate them on the user side, as I did eventually.

sin-mike commented 4 years ago

input_inverted I would see it as a feature, it's a nice functionality for sure. But what happend now is that the 'researcher' told me he recorded his data with 'inverted' on, then I use this IO to load data and generate a .dat file for klusta (using spikesorting in fact), and in the parameters I ask klusta to look for negative peaks. For me a line in the docstring would be enough, so I would be aware of this. If this is default behaviour of all IO's in neo you can probably mention it somewhere in the documentation as well.

A: Klusta is working pretty well with this parameter to be ON by default. That's why it is there. You should bother not about how an experimenter did his config, inverted or not. In addition, spyking circus working with ncs files uses this parameter the same way and it works well: https://github.com/spyking-circus/spyking-circus/blob/8efc51b357ec04cb924bcdb5601eeda8a6bfcb31/circus/files/neuralynx.py#L250

TRuikes commented 4 years ago

Hi @sin-mike, good that you join the discussion, I'm fairly inexperienced I'd say, so it's good to have some more input from experimentalists. I'll comment below on your remarks, after this I'll continue on the PR: #820. I also uploaded examples of my breaking data. And reading back I realized i didn't mention that the issues above cause to IO to break for me. Let me recap the 3 remaining issues:

IO does not allow ncs files in a session to have different lengths (I think this good). But because we had dsp filtering on 1 out of 4 chans on a tetrode, this channel had slightly more datapoints due to padding
IO requires gaps to have the same time onset/offset (also a good requirement). But by treating ringbuffer gaps, the 4 ncs files will have different gap starts and stops, causing it to break. I think patching the data on broken points (not filling gaps that the experimenter intended) is the cleanest solution
neuralynx can have dsp filter on, without compensating for its lag. The IO doesn't break here, but then has misaligned data loaded. Check the data I uploaded, maybe this makes more sense if you see it.

Then all of these issues can be solved simultaneously by implementing a flow along the lines:

check data integrity
if broken fix
continue with current IO I can for sure do this myself for our projects, but it might be good to have in neo aswell (@JuliaSprenger, @samuelgarcia what do you think?)

TRuikes commented 4 years ago

@sin-mike if you do this:

you have only 15 secs recorded of a minute with other 45 secs for a pause for 100 trials or so in one record

it is recorded in the events file right? then you can define your segments by recording starts and stops. @JuliaSprenger is there a specific reason why you check for gaps in the timestamps to define segments? Why not use the recording start and recording stop signals to define segments?

NeuralEnsemble / python-neo

[NeuralynxIO] Update .ncs file reading #819