feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.94k stars 313 forks source link

SmartCorrelatedSelector does not behave consistently on different hardware with highly correlated features #824

Open ClaudioSalvatoreArcidiacono opened 1 day ago

ClaudioSalvatoreArcidiacono commented 1 day ago

Describe the bug The behavior of SmartCorrelatedSelector is unpredictable when there are features that are very similar, so similar that they have the same number of missing values or the same single feature model perforamnce.

In particular I have noticed that running the exact same code, with same python version and package versions on my development machine (a mac m1) and on a different machine (a linux-based remote node) I get different results.

To Reproduce Run the code below on two different machines:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from feature_engine.selection import SmartCorrelatedSelection
import pandas as pd
from sklearn.model_selection import StratifiedKFold

# Generate a synthetic dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=10, random_state=42)

# Convert to DataFrame for feature naming
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)

# Duplicate features
for i in range(10):  # Duplicating the first 10 features
    X[f'feature_{i}_duplicate'] = X[f'feature_{i}']

# Initialize the Logistic Regression model
model = LogisticRegression()

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Initialize the SmartCorrelatedSelector
selector = SmartCorrelatedSelection(
    estimator=model,
    selection_method='model_performance',
    threshold=0.9,  # Example threshold for correlation
    cv=cv
)

# Fit the selector to the data
selector.fit(X, y)

# Get the features that have been dropped
dropped_features = selector.features_to_drop_
print(dropped_features)

On one machine I get:

['feature_0', 'feature_1_duplicate', 'feature_2_duplicate', 'feature_3', 'feature_4', 'feature_5', 'feature_6_duplicate', 'feature_7', 'feature_8', 'feature_9']

Whether on another machine I get:

['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7_duplicate', 'feature_8', 'feature_9']

Notice how the dropped features mismatches in the two sets.

For the tests above I used the following packages versions:

Expected behavior I would expect the two runs to give the exact same results.

Screenshots N/A

Additional context I have already implemented an alternative version of SmartCorrelatedSelector that does not have this issue and I would like to contribute to the project by sharing my version.

There are 2 reasons for the issue above

  1. When feature values are sorted (for example here) the default sorting algorithm of pandas series is used, which is quicksort. The quicksort sorting algorithm is not stable (see pandas doc), meaning that in cases of ties it does not keep the original feature order. To solve that I have added the parameter kind="mergesort" to every call to pd.Series.sort_values. mergesort is a stable sorting algorithm and it ensures the same ordering, also in case of ties.
  2. Specifically for the selection_method='model_performance', the temp_set is here defined as a set, which is a collection that does not preserve order. When this value is returned and it is used in here the original order of the feature is not preserved anymore, when the features are finally sorted in here even with mergesort as a sorting algorithm the result will differ in case of ties (because the original order is not preserved due to the set issue). To solve this second point I have changed the temp_set variable to be a list instead of a set()
ClaudioSalvatoreArcidiacono commented 1 day ago

I have submitted a PR with my suggestions, let me know if you would like me to make changes :)

solegalli commented 1 day ago

Hi @ClaudioSalvatoreArcidiacono

Thanks for the issue.

If you set the random_state in the LogisticRegression, do you still see this behaviour?