JuliaDynamics / Associations.jl

Algorithms for quantifying associations, independence testing and causal inference from data.
https://juliadynamics.github.io/Associations.jl/stable/
Other
151 stars 13 forks source link

mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays #367

Open liuyxpp opened 5 months ago

liuyxpp commented 5 months ago

In Python + scikit-learn

from sklearn import feature_selection
import numpy as np

X = np.arange(10)
X = X / np.pi
X = X.reshape((10,1))
y =  np.array([0.1, 0.3, 0.2, 0.4, 0.5, 0.7, 0.6, 0.8, 1.0, 0.9])
# By defaut: n_neighbors = 3
feature_selection.mutual_info_regression(X, y)   # = 0.69563492

In Julia + CausalityTools

import CausalityTools: mutualinfo, KSG1, KSG2

X = (0:9) ./ π
y = [0.1, 0.3, 0.2, 0.4, 0.5, 0.7, 0.6, 0.8, 1.0, 0.9]
mutualinfo(KSG1(k=3), X, y)  # = -0.43223601423458935
mutualinfo(KSG2(k=3), X, y)  # = 0.4746008686099013

The results are clearly different. In addition, in scikit-learn, if mi<0, 0 will be reported.

For running time, I have

In Python

X = np.random.rand(171522,1)
y = np.random.rand(171522,)
tic=time.perf_counter(); feature_selection.mutual_info_regression(X,y); toc=time.perf_counter()
array([0.00072401])
toc - tic  # = 0.96151989325881 seconds

In Julia

mutualinfo(KSG1(k=3), rand(171522), rand(171522))
@time mutualinfo(KSG1(k=3), rand(171522), rand(171522))  # 2.245426 seconds (857.67 k allocations: 83.230 MiB)

mutualinfo(KSG2(k=3), rand(171522), rand(171522))
@time mutualinfo(KSG2(k=3), rand(171522), rand(171522))  # 2.113910 seconds (857.68 k allocations: 91.082 MiB)

So, Julia implementation is much slower than Python implementation.

kahaaga commented 5 months ago

Hey, @liuyxpp! Thanks for the report.

Can you report back the output of Pkg.status() in the same session as you're testing, so I know which versions of packages you're using?

using Random; rng = MersenneTwister(1234)
using CausalityTools, BenchmarkTools
x = rand(rng, 10000)
y = rand(rng, 10000)

mutualinfo(KSG1(), x, y) # run once for precompilation
@btime mutualinfo(KSG1(), $x, $y) # now time it

If the performance gain is also evident after testing properly with @btime, then the difference might be due to implementation differences.

kahaaga commented 5 months ago

@liuyxpp Also: are you running also sort of multithreading for the Python code?

kahaaga commented 5 months ago

It would also be nice to see whether this apparent performance difference is also true for higher dimensional data. In the Ross paper which is linked in the python docs, they use a specialized fast implementation for 1D data. Could you try it out with two large 3D datasets, for example?