LBL-EESA / fastkde

Other
50 stars 10 forks source link

Applying density to scatterplot #33

Closed Cota123 closed 5 months ago

Cota123 commented 6 months ago

Hi, First of all, I want to thank you for provinding us this wonderful package. I was trying to create a scatterplot that shows the density by the means the color of plots. The following code that I have written does works but it's too slow, especially when the datasize is large. So I was wondering how to do the same thing using fastkde. I really appreciate if you would help me with this.

import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import numpy as np

# sample data
x = np.random.normal(20, 2, 10000)
y = np.random.normal(20, 4, 10000)
xy = np.vstack([x, y])

# KDE
z = gaussian_kde(xy)(xy)

# scatterplots colored based on density 
plt.figure(figsize=(4, 3))

plt.scatter(x, y, c=z, s=5, marker='o', cmap='jet', alpha=1)

# x,y labels
plt.xlabel('x')
plt.ylabel('y')

plt.show()

image

taobrienlbl commented 6 months ago

Hi @Cota123 - I'm glad to hear of your interest in fastKDE for this case.

Yes, it is technically possible (an example is given below) and fastKDE yields a higher-quality estimate of the probabilities than scipy.stats.gaussian_kde, but this isn't a use-case that fastKDE was originally designed for. As a result, fastKDE is in fact slower than the scipy code. This is because the aspect of fastKDE that makes it 'fast' is the use of a non-uniform Fourier transform for performing the KDE in spectral and the use of an FFT for transforming the KDE estimate back to data space. For the case of estimating the KDE at specific points--and not on a uniform grid--the FFT cannot be used. Rather the implementation in fastKDE uses a direct Fourier transform, which is notoriously slow. (Also, the cython code could be optimized further to improve the speed.) The correct way to approach this would be to implement a non-uniform inverse Fourier transform. While this is technically possible, it would involve quite a lot of work: more than I have available.

So it may be that fastKDE isn't appropriate for your use case. I'll note though, that I would welcome a pull-request that improves the performance of the fastkde.pdf_at_points() function. It might be possible to optimize the cython code substantially more than I have, or it might be easy to write a short numba routine that takes advange of threads or GPUs.

But in case you do want to try, here is an example that builds on yours:

import fastkde

x = np.random.normal(20, 2, 10000)
y = np.random.normal(20, 4, 10000)
pdf = fastkde.pdf_at_points(x, y)

# scatterplots colored based on density 
plt.figure(figsize=(4, 3))

plt.scatter(x, y, c=pdf, s=5, marker='o', cmap='jet', alpha=1)

# x,y labels
plt.xlabel('x')
plt.ylabel('y')

plt.show()

image

Cota123 commented 5 months ago

Thank you @taobrienlbl so much for your reply. I got the point that what I intended to do was not what fastkde was desgined for. Also thank you for kindly letting me know that you welcome a pull-request, although it's alright for me for now.