Small Fix To Make Metric Usable With Both SKLearn's NearestNeighbors & KNNImpute

KacperKubara / distython

Distance metrics which can handle mixed-type data and missing values

57 stars 3 forks source link

Hi Kacper!

I'm working with a heterogeneous dataset and I was also surprised at the lack of heterogeneous distance variables! The data I was working with had a paper linked to using HEOM as their distance metric. I was about to implement it myself when I luckily stumbled upon your work!

While your algorithms works with algorithms such as NearestNeighbors it doesn't work with sklearn's new imputation feature: KNN Impute

I still consider myself new to Machine Learning, and this is my first time opening an issue on Github, but I implemented a small fix so it works with both the previous algorithms (actually I only tested it on NearestNeighbor implementation, but if it works there then it should work with other algorithms) and SKlearns new feature!

Issue:

If you are using a user-defined metric, KNNImpute needs a callable function that takes at least three inputs (instance one, instance two, missing_values)

Taken directly from KNNImpute

"callable : a user-defined function which conforms to the definition of _pairwise_callable(X, Y, metric, **kwds). The function accepts two arrays, X and Y, and a missing_values keyword in kwds and returns a scalar distance value."

Below is what I did to make it work.

Let me know what you think!

Thanks again for making such a great tool!

missing_values = [np.nan, 999] # Something random here heom_metric = HEOM(cat_ix=[0,1], nan_equivalents=missing_values) # missing_vals are not passsed to heom` imputer = sklearn.impute.KNNImputer(missing_vals=missing_values, metrics=lambda x, y, missing_vals: heom_metric.heom(x, y))

KacperKubara / distython

Small Fix To Make Metric Usable With Both SKLearn's NearestNeighbors & KNNImpute #9