Marijkevandesteene / MachineLearning

repo to share progress and to manage versions of exam MachineLearning (M14)
0 stars 2 forks source link

KNNImputer for the missing values #25

Closed binomaiheu closed 3 months ago

binomaiheu commented 3 months ago

So i changed the KNNImputer in the following way :

# -- first define the min max scaler and apply to the original data
imputer_scaler = MinMaxScaler().set_output(transform="pandas")
train_V2_scaled = imputer_scaler.fit_transform(train_V2)

# -- next define the imputer having 5 neighbours (default) and uniform weights
imputer_knn = KNNImputer(n_neighbors=5, weights='uniform').set_output(transform="pandas")

# -- apply to the scaled data
train_V2_scaled = imputer_knn.fit_transform(X=train_V2_scaled)

# -- and apply the inverse transform
train_V2 = imputer_scaler.inverse_transform(train_V2_scaled)

# -- interestingly, the set_output(transform="pandas") is not implemented yet on the inverse transform in sklearn,
#    so we will pour the numpy array into a dataframe ourselves (see:  https://github.com/scikit-learn/scikit-learn/issues/27843) 
train_V2 = pd.DataFrame(train_V2, columns=train_V2_scaled.columns).set_index(train_V2_scaled.index)

with some additional explanation :

Instead of separating between numerical and categorical values, we will use a KNNImputer to make optimal use of possible correlations between the features. However, as the KNN technique is sensitive to the scale of the features (it uses a distance based metric), we first have to rescale the features before being able to use a KNNImputer. Most of our features are categorical between 0 and 1, so we'll just use a MinMaxScaler between 0 and 1 to rescale to that fixed range...

Marijkevandesteene commented 3 months ago

Een probleempje voor imputer_scaler: het idee was om de knn imputer te gebruiken voor de score (en dus de scaling die je hebt toegevoegd voor score toe te passen). Maar in deze wordt volledig Train_V2 (ook output) gebruikt en die hebben we niet in score. Dit heeft fouten, hier moeten we nog iets op verzinnen.

We kunnen scaler en imputer ook enkel voor input data gebruiken, X_train_V2 = train_V2.drop(columns=['outcome_profit','outcome_damage_inc','outcome_damage_amount'], inplace=False), maar om dat dan terug in train_V2 te krijgen, daar ga ik toch nog even moeten op zoeken. staat voorlopig in comment

Marijkevandesteene commented 3 months ago

Samen bekeken en gewerkt met input data voor scaler en voor imputer