LBL-EESA / fastkde

Other
50 stars 10 forks source link

Strange ripples appearing in output #7

Closed MarcelRobitaille closed 2 years ago

MarcelRobitaille commented 2 years ago

I am experiencing an issue where strange ripples are appearing in the output.

Here is what I am talking about: image

density, intensity = fastKDE.pdf(
    signal,
    numPoints=2**12 + 1,
)

Here is an example using scipy.stats.gaussian_kde that behaves more how I would expect: image

kernel = scipy.stats.gaussian_kde(signal, bw_method=0.02)
min_intensity = min(signal)
max_intensity = max(signal)
delta_intensity = max_intensity - min_intensity
intensity = np.linspace(
    min_intensity - delta_intensity,
    max_intensity + delta_intensity,
    1000,
)
density = kernel(intensity)

It's the ripples between the two big peaks that are throwing off my peak-finding. I could believe that these are coming from the noisy points between the two big peaks. However, there are clearly no points far above or far below the big peaks, and yet we see smaller ripples there too. Is this expected behaviour? Is this the price of "fast" KDE? Do you have a suggestion on how to mitigate this? Might this be related to #6?

Thank you

MarcelRobitaille commented 2 years ago

It seems to be related to numPoints, but for other examples, I needed this high value to reduce ripples.

taobrienlbl commented 2 years ago

Hi @MarcelRobitaille - that's interesting. Ripples like this do occasionally happen, but usually only when there is something artificial about the data (like values below a certain threshold are artificially removed prior to calculating the KDE). They're usually caused by ringing associated with representing a sharp edge with a limited number of frequencies.

Your data don't seem to have any of the characteristics that I usually associated with ringing in fastKDE, so I don't immediately have any guesses.

Would you be willing to share a sample of the input data with me so that I can explore this a bit? (If you don't want to post to github, you can e-mail me directly at obrienta@iu.edu)

taobrienlbl commented 2 years ago

@MarcelRobitaille and I corresponded by e-mail over this; it turns out that a pre-processing step was the cause. The data going into fastKDE were pre-processed with a uniform rolling window to reduce some white noise. This is equivalent to convolving the data with a top-hat function, which has sharp edges that induce ringing. Since the Fourier transform of a top-hat function is a sinc function, the peaks in the sinc function are superimposed on the spectral representation of the KDE, which results in the distinct wave-like features in the resulting KDE.

An ad hoc solution is to convolve the data with a gaussian rather than a top-hat function to filter out high-frequency noise. Technically, this violates one of the assumptions underlying fastKDE: independent and identically distributed data. Practically, this doesn't seem to do anything bad in this case, but fastKDE's convergence isn't theoretically guaranteed when the iid assumption is violated. One might consider decorrelating the data (e.g., if you apply a 30-point rolling window filter, subset every 30th point; x[::30]) to avoid this.