mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays

liuyxpp commented 2 months ago

In Python + scikit-learn

from sklearn import feature_selection
import numpy as np

X = np.arange(10)
X = X / np.pi
X = X.reshape((10,1))
y =  np.array([0.1, 0.3, 0.2, 0.4, 0.5, 0.7, 0.6, 0.8, 1.0, 0.9])
# By defaut: n_neighbors = 3
feature_selection.mutual_info_regression(X, y)   # = 0.69563492

In Julia + CausalityTools

import CausalityTools: mutualinfo, KSG1, KSG2

X = (0:9) ./ π
y = [0.1, 0.3, 0.2, 0.4, 0.5, 0.7, 0.6, 0.8, 1.0, 0.9]
mutualinfo(KSG1(k=3), X, y)  # = -0.43223601423458935
mutualinfo(KSG2(k=3), X, y)  # = 0.4746008686099013

The results are clearly different. In addition, in scikit-learn, if mi<0, 0 will be reported.

For running time, I have

In Python

X = np.random.rand(171522,1)
y = np.random.rand(171522,)
tic=time.perf_counter(); feature_selection.mutual_info_regression(X,y); toc=time.perf_counter()
array([0.00072401])
toc - tic  # = 0.96151989325881 seconds

In Julia

mutualinfo(KSG1(k=3), rand(171522), rand(171522))
@time mutualinfo(KSG1(k=3), rand(171522), rand(171522))  # 2.245426 seconds (857.67 k allocations: 83.230 MiB)

mutualinfo(KSG2(k=3), rand(171522), rand(171522))
@time mutualinfo(KSG2(k=3), rand(171522), rand(171522))  # 2.113910 seconds (857.68 k allocations: 91.082 MiB)

So, Julia implementation is much slower than Python implementation.

kahaaga commented 1 month ago

Hey, @liuyxpp! Thanks for the report.

Can you report back the output of Pkg.status() in the same session as you're testing, so I know which versions of packages you're using?

Using the @time macro in Julia isn't the optimal way of testing real-world performance for snippets of code. To ensure that the code is actually slower than the Python implementation, can you please use the @btime macro from BenchmarkTools.jl to check again? Make sure to run your code once before testing, so that pre-compilation kicks in. Like so:

using Random; rng = MersenneTwister(1234)
using CausalityTools, BenchmarkTools
x = rand(rng, 10000)
y = rand(rng, 10000)

mutualinfo(KSG1(), x, y) # run once for precompilation
@btime mutualinfo(KSG1(), $x, $y) # now time it

If the performance gain is also evident after testing properly with @btime, then the difference might be due to implementation differences.

The KSG1/KSG2 implementations here are implemented strictly as stated in the original paper linked inner documentation. From what I can tell, the Python implementation is based on both the original paper and Ross's paper, which I am not familiar with. Perhaps they implement some performance enhancements? Have you had a look at the paper?
We use NearestNeighbors.jl's methods for nearest neighbor searches in our implementation. Are you familiar with which routine for nearest neighbor searches the Python method uses?

kahaaga commented 1 month ago

@liuyxpp Also: are you running also sort of multithreading for the Python code?

kahaaga commented 1 month ago

It would also be nice to see whether this apparent performance difference is also true for higher dimensional data. In the Ross paper which is linked in the python docs, they use a specialized fast implementation for 1D data. Could you try it out with two large 3D datasets, for example?

JuliaDynamics / CausalityTools.jl

mutualinfo with KSG1/KSG2's results different from scikit-learn and about 2x slower for large arrays #367