Optimal bandwidth dependent on dimensionality?

redhog commented 1 year ago

It seems the output is highly dependent on the dimensionality. That is, if you run a KDE using only the first 2 dimensions, vs the first 3, averaging over the last dimension, you get very different density distributions (but same general patterns):

taobrienlbl commented 8 months ago

Hi @redhog - apologies for taking so long to get back to you. Thanks for raising this. The main difference between the two plots looks to be the amount of high-frequency variability in the 3D version in comparison to the 2D version.

Intuitively, this behavior makes sense to me for complicated PDFs. When doing the KDE, the spectral representation of the emprical characteristic function ends up being filtered based on contiguous regions that are above a data-dependent threshold; the filter essentially retains low-frequency variability in the ECF. Regions that aren't contiguous in 2D might end up being contiguous in 3D, meaning it's possible for high-frequency regions of the ECF to appear in the 3D KDE that aren't present in the 2D version.

This specification of the filter is somewhat arbitrary as long as it follows some mathematical guidelines defined by Bernacchia and Pigolotti (2011); how best to specify the filter is an open research question. The issue you raise here makes me wonder whether there might be a way to specify the filter such that results are consistent between high- and low-dimensional versions of the PDF.

That said, this is beyond the scope of a simple bug fix, as I think there might be a paper to write on this topic. And I'm not even sure whether it is in fact a bug or a feature.

I'm going to leave this open for now in hopes that someone might be able to provide some insight.

redhog commented 8 months ago

Heyas! Thanks for taking the time to look into this, even if it took such a long time @taobrienlbl . I don't actually don't remember what exactly I used this for or any other details, but I think this is an interesting issue in general since it might surprise users just like it did me. If nothing else, your explanation is a good start for a note in the documentation.

LBL-EESA / fastkde

Optimal bandwidth dependent on dimensionality? #17