Separate plotting functionality from kde estimation for better interoperability with visualization libraries

k-a-mendoza commented 7 months ago

It would be nice to not have matplotlib so tightly integrated into the plot.py functionality so that we can exersize tighter control over the output graphs. Doing this would also result in faster adoption by alternate plotting libraries like plotly, matplotlib, bokeh, pyvista, etc;.

taobrienlbl commented 7 months ago

thanks for this comment! Do you have specific thoughts on how the interface might change to facilitate this? Also, I'll note that I happily welcome pull requests!

k-a-mendoza commented 7 months ago

Hmm lets see.

I'm less familiar with your specific KDE implementation, but i think the last example can be easily adapted to act as a general numerical tool.

So adapting this example

PDF = fastkde.pdf(x, y, var_names = ['x', 'y'])

would just require either adding a simple keyword argument

pdf_meshgrid = fastkde.pdf(x, y, data_only=True)

or perhaps a separate function:

pdf_meshgrid = fastkde.pdf_array(x,y,**kwargs)

All that is required is that PDF return a scalar mesh field, and potentially the coordinate data if its not the same as the input coordinates. If the coordinates are rectilinear, they can simply be the 1-D arrays for each axis. If they are scattered, then perhaps an intermediate function to regrid the data?

pdf_meshgrid,xx,yy = fastkde.pdf_array(x,y,return_coords=[??]**kwargs)

For reference, see the pcolormesh and contour docs. The calls require the same input data, X, Y, and Z.

Does this seem easy to do?

taobrienlbl commented 7 months ago

Hi @k-a-mendoza , thanks for clarifying. That is in fact what fastKDE does, so fortunately the change I need to make is one of documentation rather than interface/code.

Part of the confusion might originate in a new feature that I introduced in v2.0 that hasn't received much feedback. If the xarray library is present, fastkde.pdf by default will return an xarray.DataArray object (that's the type of PDF in the first code example you quote). xarray is a wrapper around numpy for structured datasets. If xarray is not present, then fastkde.pdf returns a PDF array and a tuple of axis arrays. Note that there is an important additional difference in behavior if an xarray object is returned: in this case fastkde.pdf crops the PDF and axis values to so that they just barely surround the input data values. If an xarray object is not returned, then a much wider range of axis values is returned (this is the set of axis values on which the KDE is actually calculated).

Expanding on this, here is some sample code to extract the arrays you are referring to if xarray is installed:

# if xarray is installed
PDF = fastkde.pdf(x, y, var_names = ['x', 'y'])
pdf_meshgrid = PDF.values
# get the axis arrays
xx = PDF.x
yy = PDF.y

If xarray is not installed, then here would be the behavior:

# if xarray is not installed
pdf_meshgrid, (xx, yy) = fastkde.pdf(x, y)

# if xarray is in fact installed, the above behavior can be forced with the use_xarray=False option
# pdf_meshgrid, (xx, yy) = fastkde.pdf(x, y, use_xarray=False)

Does this address your original comment?

k-a-mendoza commented 7 months ago

You know what, after playing with the package a bit, I don't think this suggestion applies anymore. the returned xarray is already useful enough and expressive enough for the user to make their own plots.

However, there are a few things that I think would be appropriate on a readthedocs or perhaps a follow up publication. As far as I can tell, there are two other popular KDE implementations in the python ecosystem:

Scikit-learn's KernelDensity object.. Used like:KernelDensity.(kernel='gaussian', bandwidth=0.2).fit(X). Offers lots of options for different types of ND kernels and kernel scaling.
Scipy. Used like stats.gaussian_kde(values). Fewer options, but appears to easily click with numpy. It is still far slower than fastkde, but the option of specifying the bandwidth is quite useful. It's also the method that runs under the hood of seaborn, so mimicking it's interface may allow seaborn to flexibly use your package.

One thing that isn't clear from fastkde's readme file is what numerical options are available. I'm not sure if the FFT process under the hood allows for different bandwidths or kernels, but if its easy to implement I think that would be quite useful. Seaborn also has an option to specify contour levels. From the documentation:

"""
levels  : nt or vector
    Number of contour levels or values to draw contours at. A vector argument must have increasing values in [0, 1]. 
    Levels correspond to iso-proportions of the density: e.g., 20% of the probability mass will lie below the 
    contour drawn for 0.2. Only relevant with bivariate data.
"""

Playing around with assinging manual contours from the returned xarray object on some geoscience data, I'm finding that there are edge values in the pdf near zero that are quite large, although the majority of points lie in the 0 - 0.1 range. So perhaps more documentation on how to normalize this to iso-proportions or options to enforce zero amplitude at zero frequencies may be nice.

Actually, now that I think of it, and option to return the FFT transform and perform frequency data augmentation or other signal processing tricks could really help; this would have to be paired with an option to injest the transformed FFT back into data coordinates.

taobrienlbl commented 7 months ago

These are all good comments. For the most part, I think most of your comments seem to point toward the need for better documentation, which is a known deficiency in this package. (It's one of the downsides of producing code as a scientist: it's taken me a long time to get fastKDE to an even remotely professional point, and there's still a lot more to go.) I'd reemphasize that I welcome pull requests large and small if you see easy ways to improve things! I'll go through your comments point-by-point below to indicate why I think they're mainly related to poor documentation.

One thing that isn't clear from fastkde's readme file is what numerical options are available. I'm not sure if the FFT process under the hood allows for different bandwidths or kernels, but if its easy to implement I think that would be quite useful.

Yes, the documentation is unclear on this. There's only one available in fastKDE, and it's the Bernacchia and Pigolloti algorithm, which specifies both the bandwidth and kernel shape based on the data. Bandwidth isn't a controllable parameter in this algorithm.

Seaborn also has an option to specify contour levels. The fastkde.pair_plot() function implements this, but it would be better to allow more flexibility. This should be better documented.

Playing around with assinging manual contours from the returned xarray object on some geoscience data, I'm finding that there are edge values in the pdf near zero that are quite large, although the majority of points lie in the 0 - 0.1 range. So perhaps more documentation on how to normalize this to iso-proportions or options to enforce zero amplitude at zero frequencies may be nice.

This is a challenge that I'm not yet certain how to handle. There are certain distribution types that are challenging for fastKDE: heavy-tailed distributions (these are hard for any KDE scheme) and bounded data (causes spectral ringing). I haven't yet figured out a way to give good, general guidance on how to deal with challenging distributions. I'd love thoughts here.

Actually, now that I think of it, and option to return the FFT transform and perform frequency data augmentation or other signal processing tricks could really help; this would have to be paired with an option to injest the transformed FFT back into data coordinates.

This is actually possible, but it's not a documented feature! The object-oriented interface to fastKDE (fastkde.fastKDE) returns a fastKDE object that stores the fourier transform PDF (fastkde.fastKDE.phiSC) and has the ability to apply the FFT to obtain the resultant PDF (fastkde.fastKDE.__transformphiSC__). These routines are reasonably well-commented in fastkde/src/fastkde/fastKDE.py, but I have not yet had time to produce an RTD that documents all this.

So thanks again for all these comments. And it looks like you're doing some cool research: maybe we'll cross paths at AGU this year! Cheers

k-a-mendoza commented 7 months ago

Ahh man. I would love to go to AGU. Unfortunately with graduating and employment sorta up in the air, im not sure if that will be possible. we'll see.

This is a challenge that I'm not yet certain how to handle. There are certain distribution types that are challenging for fastKDE: heavy-tailed distributions (these are hard for any KDE scheme) and bounded data (causes spectral ringing). I haven't yet figured out a way to give good, general guidance on how to deal with challenging distributions. I'd love thoughts here.

Hm. I wonder how easy it would be to transform the data into a normal-like distribution, do the KDE, then inverse the coordinates when plotting? If it works, then maybe that's worthy of a simple readthedocs blurb rather than a reworking of the internals. The data I'm working with is a little strange in that it is both somewhat linear and very much poissonian (two geophysical images with wildly varying resolution as a function of depth).

As for the bandwidth, maybe its possible to "trick" the FFT into having a specific kind of bandwidth through frequency data augmentation. I'll think on it more.

I'd be happy to contribute to a readthedocs when i finish my defense. I got a number of projects that I want to publish with similar readthedocs formatting, and it would be good to get some experience in that domain.

taobrienlbl commented 7 months ago

Hm. I wonder how easy it would be to transform the data into a normal-like distribution, do the KDE, then inverse the coordinates when plotting? If it works, then maybe that's worthy of a simple readthedocs blurb rather than a reworking of the internals. The data I'm working with is a little strange in that it is both somewhat linear and very much poissonian (two geophysical images with wildly varying resolution as a function of depth).

That's what the log_axes option tries to do in fastkde.pdf...it works reasonably well for powerlaw data with relatively small exponents. But it doesn't work for data with zeros like what you're describing. I did at one point attempt a solution where I fit a spline to the empirical CDF and then used that fit to transform the data to a unit normal distribution...but it didn't seem to work well. But I might also not have implemented it correctly in the ~half-day I spent working on it.

If you can send some sample data my way, I might be able to help determine a way to get fastKDE to work with it. (The part of me that's trying to procrastinate on grading is offering that.)

As for the bandwidth, maybe its possible to "trick" the FFT into having a specific kind of bandwidth through frequency data augmentation. I'll think on it more.

That should definitely be possible. The fastkde.fastKDE.kappaSC variable contains the transform of the kernel. If you were to overwrite this with the transform of a gaussian kernel of a given bandwidth (e.g., determine the transform analytically and then calculate its values at the frequency points), that would do what you're describing. It'd be an FFT-based KDE, but it would no longer be 'fastKDE'

I'd be happy to contribute to a readthedocs when i finish my defense. I got a number of projects that I want to publish with similar readthedocs formatting, and it would be good to get some experience in that domain.

If you get the chance, I'd definitely appreciate it. Until then, best of luck on finishing up your dissertation and determining what comes next!

LBL-EESA / fastkde

Separate plotting functionality from kde estimation for better interoperability with visualization libraries #37