KacperKubara / distython

Distance metrics which can handle mixed-type data and missing values
57 stars 3 forks source link

Small Fix To Make Metric Usable With Both SKLearn's NearestNeighbors & KNNImpute #9

Open darwin-a opened 4 years ago

darwin-a commented 4 years ago

Hi Kacper!

I'm working with a heterogeneous dataset and I was also surprised at the lack of heterogeneous distance variables! The data I was working with had a paper linked to using HEOM as their distance metric. I was about to implement it myself when I luckily stumbled upon your work!

While your algorithms works with algorithms such as NearestNeighbors it doesn't work with sklearn's new imputation feature: KNN Impute

I still consider myself new to Machine Learning, and this is my first time opening an issue on Github, but I implemented a small fix so it works with both the previous algorithms (actually I only tested it on NearestNeighbor implementation, but if it works there then it should work with other algorithms) and SKlearns new feature!

Issue:

If you are using a user-defined metric, KNNImpute needs a callable function that takes at least three inputs (instance one, instance two, missing_values)

Taken directly from KNNImpute

"callable : a user-defined function which conforms to the definition of _pairwise_callable(X, Y, metric, **kwds). The function accepts two arrays, X and Y, and a missing_values keyword in kwds and returns a scalar distance value."

Below is what I did to make it work.

image

image

Let me know what you think!

Thanks again for making such a great tool!

KacperKubara commented 4 years ago

Hi! Sorry for the late reply, I've been quite busy with few deadlines :) Thanks for opening the issue and I am glad that you like Distython. Wow, I am super glad that Scikit-Learn actually added a new imputation method that is finally not a mean/mode imputation!

Sounds like a good feature to add. I am currently out of time due to my undergrad dissertation and I won't be able to make a PR to the repo but maybe you would like to create one? If you struggle with a PR, let me know and I would be glad to help :)

I think that atm the problem (without modifying the package) can be circumvented as follows:

  1. Initialize the HEOM class with missing_values as in KNN imputer from Sklearn
  2. Use the lambda operator to push chosen parameters to heom(). I.e I think you could try something like that and see if it works:
    missing_values = [np.nan, 999] # Something random here
    heom_metric = HEOM(cat_ix=[0,1], nan_equivalents=missing_values)
    # missing_vals are not passsed to heom`
    imputer = sklearn.impute.KNNImputer(missing_vals=missing_values, metrics=lambda x, y, missing_vals: heom_metric.heom(x, y)) 

    I haven't checked if the code above works it just to give you a rough idea. I have used this package with lambda operators like that above and it worked for me :) By using a lambda operator to wrap the heom() function we can choose which parameters we can pass to the function and which we just leave unused.

Best, Kacper