JuliaStats / KernelDensity.jl

Kernel density estimators for Julia
Other
172 stars 40 forks source link

Multivariate implementation #110

Open andreasKroepelin opened 1 year ago

andreasKroepelin commented 1 year ago

This PR ports the previously existing bivariate implementation to a multivariate one, such that KernelDensity.jl now supports arbitrary dimensions. The bivariate estimation is now just a special case of the more general implementation. Note that this is backwards compatible because a BivariateKDE has the same properties as before and the multivariate implementation introduces no runtime overhead copared to the previous bivariate-specific one.

itsdfish commented 1 year ago

Are there any plans to merge this PR?

tpapp commented 1 year ago

It is on my TODO list to review and merge, it looks very nicely done so I am sure it will get merged.

Apologies for the delay, this repository is community-maintained.

itsdfish commented 1 year ago

No problem. I completely understand.

The reason I ask is that I have a use case for this feature and would like to use it as soon as possible.

I went through the PR mostly from the perspective of a user to see whether I could understand how to use the new feature.

I wanted to test the multivariate KDE on a multivariate normal. Here is my setup:

using KernelDensity
using Interpolations
using Distributions
using LinearAlgebra
using Random 

Random.seed!(50)

# number of dimensions 
n = 2
# vector of means 
μ = zeros(n)
# covariance matrix 
Σ = I(n)
# true distribution 
true_dist = MvNormal(μ, Σ)
# data for the kde
data = rand(true_dist, 10_000)

kde_data = Tuple(data[i,:] for i ∈ 1:size(data,1))
kd = kde(kde_data)
# estimated distribution 
est_dist = InterpKDE(kd)

However, it was unclear to me how to compare the interpolated and true pdfs. Here is what I found:

test_vals = rand(true_dist, 20)
kde_test_vals = Tuple(test_vals[i,:] for i ∈ 1:size(test_vals,1))
# why does this return a 20 X 20 matrix? 
est_pdfs = pdf(est_dist, kde_test_vals...)
# throws an error 
# est_pdfs = pdf(est_dist, test_vals)
true_pdfs = pdf(true_dist, test_vals)

I tried passing the test data in different forms, but none of them returned the expected output.

Another issue I noticed is that kde hangs and nearly freezes my computer when n=4. I'm not sure whether that is to be expected, but that limitation is not clear to the user. One last issue that might be addressed in the README file: the dimensions of the matrix should be explained. When the data are sampled from MvNormal, they must be transposed:

kd = kde(data')

such that the rows are samples, and the columns are indexed variables.

If possible, please let me know how to use pdf with multiple inputs.

Let me know if I can help in anyway.

Update

It appears that n = 3 is the state of the art. The following Python code also hangs, but doesn't nearly freeze my computer. It might be good to note this limitation in the README and doc strings.

import numpy as np
from fastkde import fastKDE
import pylab as PP

N = 100
var1 = 50*np.random.normal(0, 1, N) + 0.1
var2 = 50*np.random.normal(0, 1, N) + 0.1
var3 = 50*np.random.normal(0, 1, N) + 0.1
var4 = 50*np.random.normal(0, 1, N) + 0.1

myPDF,axes = fastKDE.pdf(var1,var2, var3, var4)
andreasKroepelin commented 1 year ago

Another issue I noticed is that kde hangs and nearly freezes my computer when n=4.

Note that a KDE of 4-dimensional data must produce a 4-dimensional array. So it is likely that your computer runs out of memory. You are right that a warning regarding that could be added to the documentation. I guess it is possible to have a lot of fun with sparse data structures to make this work for higher dimensions, but as far as I understand, this package is supposed to be a general purpose "standard" implementation without such fancy things.

One last issue that might be addressed in the README file: the dimensions of the matrix should be explained. When the data are sampled from MvNormal, they must be transposed:

I would claim that this is rather a peculiarity of Distributions.jl. Usually, Julia's column-major memory layout for arrays suggests that samples are represented as rows and features as columns. I just checked that recently for another purpose and to my knowledge all implementation of the Tables.jl interface store features contiguously in memory. So it is reasonably that the API of kde also assumes that. Maybe it can be documented better. I believe the reason why Distributions.jl's rand function behaves differently is because samples are produced one after the other so for them it is beneficial cache-wise to store samples as columns.

# why does this return a 20 X 20 matrix?

So my implementation just generalises what the bivariate one did before. However, you might be right that the implementation was and is a bit strange. Let's have a look. old:

pdf(ik::InterpKDE,xs::AbstractVector,ys::AbstractVector) = [ik.itp(x,y) for x in xs, y in ys]

new:

function pdf(ik::InterpKDE{K, I}, xs::Vararg{AbstractVector, N}) where
    {N, R, K <: MultivariateKDE{N, R}, I}

    [ik.itp(x...) for x in Iterators.product(xs...)]
end

So for x in xs, y in ys becomes for x in Iterators.product(xs...), basically. This product is the reason why you get a 20x20 matrix. I believe the implementation should instead be for xz in zip(xs, ys) (old), or for x_tuple in zip(xs...) (new). Then, you would obtain a vector of length 20 in your case, which is probably what you are expecting, right?

Maybe someone more familiar with the original implementation can comment on that? According to git blame, the code is at least seven years old, so there is probably no point in asking the author :sweat_smile:

Edit: I just noted that the same confusion arose in #102.

itsdfish commented 1 year ago

@andreasKroepelin, thank you for your explanation. What you said makes sense to me.

You are right. I was expecting a vector with 20 elements, comparable to pdf(MvNormal()).

It might be worth changing that behavior. In the meantime, can you recommend a workaround?

andreasKroepelin commented 1 year ago

You can basically implement the right behaviour yourself, maybe in combination with what was suggested in #102:

kd = kde(kde_data)
test_vals = rand(true_dist, 20)
est_dist = InterpKDE(kd)
est_pdfs = [est_dist.itp(col...) for col in eachcol(test_vals)]