KacperKubara / distython

Distance metrics which can handle mixed-type data and missing values
57 stars 3 forks source link

The first parameter passed from the Knn.fit to the Metric does not belong to the DataFrame #16

Open 08volt opened 3 years ago

08volt commented 3 years ago

Hi, I got a problem with the Knn.fit when I'm using the HVDM metric. To visualize the data I'm working on I added two prints at the beginning of the function:

Schermata 2020-10-23 alle 17 03 39

I dont understand why when I call the Knn.fit the "x" parameter it's this different from the dataframe:

Schermata 2020-10-23 alle 17 05 36

Anyone can help me understand what the knn is doing?

Thank you

KacperKubara commented 3 years ago

Hi, HVDM seems to have a bug and doesn't work correctly, please use HEOM instead

varshakhandekar commented 3 years ago

import numpy as np from sklearn.neighbors import NearestNeighbors from sklearn.datasets import load_boston

Importing a custom metric class

from distython import HEOM from distython import HVDM from distython import VDM

Load the dataset from sklearn

boston = load_boston() print(type(boston)) print(boston.data.shape) boston_data = boston["data"]

Categorical variables in the data

categorical_ix = [3, 8] y_ix=[12]

The problem here is that NearestNeighbors can't handle np.nan

So we have to set up the NaN equivalent

nan_eqv = 12345

Introduce some missingness to the data for the purpose of the example

row_cnt, col_cnt = boston_data.shape for i in range(row_cnt): for j in range(col_cnt): rand_val = np.random.randint(20, size=1) if rand_val == 10: boston_data[i, j] = nan_eqv

Declare the HEOM with a correct NaN equivalent value

heom_metric = HEOM(boston_data, categorical_ix, nan_equivalents = [nan_eqv]) hvdm_metric = HVDM(boston_data, y_ix,categorical_ix, nan_equivalents = [nan_eqv])

Declare NearestNeighbor and link the metric

neighbor = NearestNeighbors(metric = heom_metric.heom) neighbor1 = NearestNeighbors(metric = hvdm_metric.hvdm)

Fit the model which uses the custom distance metric

neighbor.fit(boston_data) neighbor1.fit(boston_data)

Return 5-Nearest Neighbors to the 1st instance (row 1)

result = neighbor.kneighbors(boston_data[0].reshape(1, -1), n_neighbors = 5) result1 = neighbor1.kneighbors(boston_data[0].reshape(1, -1), n_neighbors = 5)

print(result) print(result1)

Error Division by zero is not allowed!

UnboundLocalError Traceback (most recent call last)

in 1 # Fit the model which uses the custom distance metric 2 neighbor.fit(boston_data) ----> 3 neighbor1.fit(boston_data) 4 # Return 5-Nearest Neighbors to the 1st instance (row 1) 5 result = neighbor.kneighbors(boston_data[0].reshape(1, -1), n_neighbors = 5)
varshakhandekar commented 3 years ago

there is division by zero error in HVDM