Noise estimation - Githubissues

SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.

https://spikeinterface.readthedocs.io

MIT License

498 stars 187 forks source link

Noise estimation #1681

Open florian6973 opened 1 year ago

florian6973 commented 1 year ago

Hi!

I am using spikeinterface to do spike sorting, and I noticed in recent papers like https://iopscience.iop.org/article/10.1088/1741-2552/ac8077/meta that the RMS can be used as an efficient estimate of the noise. Therefore, I have a few questions related to the current get_noise_levels function:

Do you think that spikeinterface may add the RMS as an option for noise estimation in the future?
Some papers like https://authors.library.caltech.edu/13699/1/QUInc04.pdf mention using the MAD to estimate the standard deviation and therefore the noise, but I have not found any discussion about the effect of taking random chunks, as spikeinterface does. I understand that it is useful for time and memory, however I am puzzled with the way it is done: I would have thought naturally about random sampling of the timeseries, but if I am not mistaken, some samples can be selected in different chunks (they can overlap). By any chance do you know what are the mathematical concepts related to this choice?

Thanks in advance for your help!

Best,

florian6973

h-mayorquin commented 1 year ago

Hi, @florian6973, what do you mean by RMS? Real mean square? (the paper is behind a paywal so you would need to quote the relevant sections). The most common estimator for std was added recently by @DradeAW if I remember well:

get_noise_levels:

https://github.com/SpikeInterface/spikeinterface/blob/a9b4c34200be91a589ad145c976c5a177ce6746e/src/spikeinterface/core/recording_tools.py#L82C5-L100

As you can see it uses numpy std which uses the classical estimator of the standard deviation (which is used to characterize noise):

https://numpy.org/doc/stable/reference/generated/numpy.std.html

This is what you mean by RMS probably?

Concerning your question about statistics there is nothing special about what spikeinterface does. You are taking samples to estimate a value. Using the whole time series would be just using all the samples available which could be challenging if your samples do not fit in memory as you have mentioned, hence chunking. Note that there is nothing temporal about this, you are asuming that the samples are independent to estimate noise. This in fact assumes that the series is stationary.

Some samples are used twice as you well mentioned but not that often. The relevant concept here would be sampling with and without replacement. In my view, for the assumptions that both of those methods make (noise is normal) and the sample sizes that we have by default here this should not matter much or at all. Just check the sizes of the confidence intervals for these estimators based on N and you will see they become small fairly quickly.

florian6973 commented 1 year ago

Hi h-mayorquin! Thank you so much for your detailed reply!

Sorry about the paywal, I did not check this. By RMS I mean root mean square: Regarding some of the results they obtained: where NEO stands for non-linear energy operator, ABF for Ada-BandFlt and ABS for absolute value (preprocessing step) With their comments:

And of course the RMS is equivalent to the std when the signal mean is 0, which is the case with a bandpass filtered signal. Awesome if it has been added recently, I will wait for the release of the new version with it.

Thanks for summarizing the main assumptions: stationary signal, normal noise... Indeed, for confidence interval the one for the std is quite easy (e. g. https://dl.icdst.org/pdfs/files3/22a131fac452ed75639ed5b0680761ac.pdf page 122), whereas for MAD it is a bit more complex (for instance https://arxiv.org/pdf/1910.00229.pdf). In any case what is puzzling me is the way we randomly select the samples, randomly choosing a start index and taking the next chunk_size samples, because if I am not wrong it does not exactly correspond to the usual with or without replacement definitions (and so we cannot apply the traditional formulas)...

h-mayorquin commented 1 year ago

Bear in mind that we use MAD to estimate the standard deviation:

https://en.wikipedia.org/wiki/Median_absolute_deviation

We don't estimate MAD as a measure of scale as in your paper.

As for the other issue that is puzzling you, I don't understand your concern. At the end, all the formulas mentioned here sum over the samples to get the estimators. To me, the sum means that the collection method does not matter. Taking chunks in the way that we do only means that some samples sometimes will be represented twice (with replacement).

That said, statistics is hard and I might be confused about it or missing something. Let me know how you think about it. It would be nice to do some experiments but unfortunately I don't have time for this right now.

cwindolf commented 1 year ago

Hi @h-mayorquin , I'm not sure that's strictly true about MAD. get_noise_levels supports both MAD- and standard deviation-based estimates of the noise levels. For recordings which have been highpass or bandpass filtered, the standard deviation estimate coincides with the RMS (up to numerical differences due to the chance that the mean in a chunk could differ from 0 due to noise). And, in recordings which consist purely of Gaussian noise, the MAD-based estimate and the RMS will also coincide, although MAD helps in the presence of spikes.

@florian6973 I think there are two points of interest that you raise:

Using chunks rather than many single frames: although it's true that frames within chunks may be more correlated than random frames, and that thus the effective sample size of an estimator computed from chunks would be smaller than one computed from many single frames, I would expect that this issue is minor. Actually, it may be the case that in terms of estimator variance per unit of computer time it might be easier to do better by increasing the chunk size rather than increasing the number of chunks, assuming that recordings are stationary (which they never are -- so use lots of chunks!). Finally, I think that this sort of thing is pretty standard practice -- for instance Kilosort does the same thing.
Possible overlapping of chunks: yeah, it could be nice to ensure that the chunks don't overlap. I doubt this would typically matter, but it could be nice to implement.

Finally, one last thing that I recommend is increasing the num_chunks_per_segment from the default of 20 to, say, 1 or 2 hundred, or alternatively it could be parameterized as a number of chunks per thousand seconds of recording (this would follow more closely what @oliche does, which is scan recordings spaced by a uniform interval, like 50s). @samuelgarcia / @alejoe91 I was wondering actually if a much larger number of chunks could be set as default in get_random_data_chunks -- I typically always do this for instance when using preprocessing.zscore, since otherwise I occasionally observe a large variance in the results (across datasets, not runs, since a random seed is set), and others may be unknowingly experiencing the same issue.

h-mayorquin commented 1 year ago

@cwindolf I am not sure what you are referring to say that that is not strictly true about MAD. I said a lot of things above. But maybe this:

The formula used in the code uses MAD to estimate the standard deviation using the assumption that the data is distributed normally:

https://github.com/SpikeInterface/spikeinterface/blob/a9b4c34200be91a589ad145c976c5a177ce6746e/src/spikeinterface/core/recording_tools.py#L82C5-L100

We estimate the std using two different method. In one case directly in another case using the MAD as an intermediate but we don't use MAD as a characterization of noise.

Or was it something else?

cwindolf commented 1 year ago

Sorry, I think I just had a misunderstanding about what you had written! Thanks for the clarification.

h-mayorquin commented 1 year ago

Good point about the correlation though. The estimators above all assume samples are independent and that assumption is farther from the true when sampling by chunks. That makes sense. Until I think about this twice and I realize that the "ideal" case would be to use all the samples (as much information as possible) and that would just have exactly the same problem...

But maybe the ideal case is to sample every x frames to avoid temporal correlation.

In fact, I am glad that you jumped in because I have been confused about this for a while. I think that @DradeAW was mentioning something about requiring many more samples as well. For someone without subject matter expertise as you guys this is baffling. Why on earth would you need a lot of samples to estimate an std? Is it because you include peaks and you want to get rid of them? What is your intuition? How do you guys think about it?

cwindolf commented 1 year ago

In my view, we need lots of samples because of nonstationarity: due to drift or changes in firing patterns, one chunk of the recording can lead to very different noise level estimates from another, so take lots of chunks to average away this effect.

It leads to more questions. For instance, I think the IBL pipeline computes the RMS per channel across many chunks and then takes the median of these, rather than taking the RMS (or MAD/0.675) of concatenated chunks, and this is a method for dealing with nonstationarity since it will average over nice chunks while being robust to weird ones thanks to the median. You could imagine other strategies too, including not trying to have one global noise level.

(And re: correlated samples, these estimators are still valid when samples are correlated, it's just that using correlated samples leads to less reduction in variance of the estimator compared with using independent samples. Since it's quicker to grab fewer larger chunks than more tiny ones, it may still be a good tradeoff in terms of the variance of the estimator.)

h-mayorquin commented 1 year ago

I see. Thanks. When I heard stationarity last time in a discussion I imagined the general case of your underlying distribution is changing across time (imagine a std or the third and fourth moment changing in time) and my thinking was, well, then there is nothing you can do in the general for that case as you don't know how that varies.

But your description above seems more like you have a long-term noise constant behavior that you want to extract and intermediate regions or zones that you want to average out. As in, there is long-term noise and then atypical small periods that have idiosyncratic noise. We want to take a lot of samples for the atypical idiosyncratic regions to wash out when we estimate the long-term noise.

Or you have drift in the mean of your distribution and you want to compare deviations within small regions as the long-term spread is sort of constant.

OK, if this is right, I agree with your intuition that a lot of smal chunks is a better strategy for the two of them. Small chunks don't let the atypical regions coagulate and are less sensitive to drift in the central measure (i.e. the mean). Plus, parameterizing by some intuitive parameter like chunks per seconds as you mention seems like a good idea to make experimentation easier for users.

florian6973 commented 1 year ago

@h-mayorquin Thanks for your reply! I see your point, but in the article they use the MAD to approximate the STD too, as shown in their definition:

What kind of experiments were you thinking about? Simulating some signals with more or less spikes (and more or less variability) and assessing how the noise is approximated using different estimators, sampling methods and parameters?

Regarding statistics, @cwindolf rephrased my thoughts very clearly, thank you very much!

thanks for mentioning Kilosort! Do you know where it is defined in the code or explained in their papers by any chance please?
yes I fully agree, I do not think this has a significant impact with long signals and small chunks but still I am not fully convinced...

Good to know about the number of chunks to consider, when you say 100 or 200, is it for 1000s of recording please?

Following your ideas, it may be relevant to implement an adaptive noise level per chunk, for instance for each chunk of 50s there would be a corresponding noise level used for the computations. However it may require some work to implement, and can be quite computationally-intensive when you want to retrieve the average over the whole signal.

h-mayorquin commented 1 year ago

I meant this article which is about the confidence interavals. https://arxiv.org/pdf/1910.00229.pdf

I just wanted to make clear the the authors operationalize noise as the std and build different methods of estimating that value. We are not using MAD as an operationalization of noise (which seems to me like a valid choice as well!).

For the experiments I was thinking about: I believe that if you take any distribution and you sample in the way that we do then variance of your estimators shoudl become rather small quickly and more or less in the same fashion for the different methods discussed here.

If you want to buidl an scenario like the one we were describing above (the underlying noise distribution has some variation) then is less clear to me. I would just put artificial spikes or regions where the noise is not sampled from the same distribution. But probably @cwindolf or the others have way better suggestions for this.

samuelgarcia commented 1 year ago

The actual strategy of get_noise_levels was quite simple. This could be extended to more scenario. For instance the one you mention take one chunk every 50s. kilosort take 1s every 25s for instance. In case of non stationarity of the recording they are many other problems than the weakness of the noise estimation. I would keep by default that the noise is stationary. I would be happy to see how fast (how buffer buffer) the noise with MAD is converging for standard dataset. I have the feeling that 1s evry 25 cost a lot in term of computing for a little difference. Very often this noise is used with another factor to detect peaks let say 5. If we put 4.9 or 5.1 how much diff does it make compare to the 0.1% diff we have woth the noise estimation ?