KDE over two distributions while avoiding querying individual datapoints.

GeorgWa commented 1 month ago

Great library!

I want to get the KDE across two distributions in the same limits with the same bandwidth. The safest way would be to query individual datapoints which might be very slow.

So far I would get the KDE estimates separately and perform some interpolation. Although, I think this will result in different bandwith. Do you have any suggestions how to do this efficiently while making sure score_decoy == score_target?

(density_decoy, score_decoy) = fastkde.pdf(
    psm_df[psm_df[decoy_column] == 1][score_column].values,
    use_xarray=False
)
(density_target, score_target) = fastkde.pdf(
    psm_df[psm_df[decoy_column] == 0][score_column].values,
    use_xarray=False
)

taobrienlbl commented 1 month ago

Thank you @GeorgWa!

Just to make sure I understand, it sounds like you're asking two separate--but related--questions:

How can one run fastkde.pdf() on two different datasets and have the PDF output at the same points for both datasets?
How can one run fastkde on two datasets, using the same kernel for both?

Is this correct?

The first is very straightforward, and I'll show an example below. The second is technically possible, but it would involve writing a minor amount of custom code using the fastkde.fastKDE object-oriented interface. If this is indeed what you're wanting to do, could you please expand a bit on what you're attempting and why you would want to use the same bandwidth for two separate PDF calculations? Knowing this would help make sure I give the right advice.

Regarding the first question, to ensure that both calls to fastkde.pdf() evaluate the PDF at the same points, you can simply add the axes= argument to the second call to specify the evenly-spaced points at which to calculate the PDF:

(density_decoy, score_decoy) = fastkde.pdf(
    psm_df[psm_df[decoy_column] == 1][score_column].values,
    use_xarray=False
)
(density_target, score_target) = fastkde.pdf(
    psm_df[psm_df[decoy_column] == 0][score_column].values,
    use_xarray=False,
    axes = score_decoy,
)

GeorgWa commented 1 month ago

Hi,

Yes, that's right!

What I want to do is using a KDE to compare the local density for two classes in a classification problem. This is what the histogram of the score looks like for the two classes:

As it's only 1D, I only need an approximation and I know the kernel beforehand I've realised that fastKDE is most likely too sophisticated for this problem :D. I've found that binning followed by fft based convolution with a gaussian kernel is actually sufficient.

Nevertheless, thanks a lot! I'm sure I will find a usecase in the future.

LBL-EESA / fastkde

KDE over two distributions while avoiding querying individual datapoints. #39