feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.79k stars 304 forks source link

SmartCorrelatedSelection when fitted returns all features in the .variables_ instead selected features as in the document #730

Closed jnsofini closed 3 months ago

jnsofini commented 3 months ago

Describe the bug I have a fitted transformer. As per the document, I was expecting to get features that are selected from .variables_ attribute, however, I get all the features returned.

To Reproduce Steps to reproduce the behavior: import pandas as pd from sklearn.datasets import make_classification from feature_engine.selection import SmartCorrelatedSelection

make dataframe with some correlated variables

def make_data(): X, y = make_classification(n_samples=1000, n_features=12, n_redundant=4, n_clusters_per_class=1, weights=[0.50], class_sep=2, random_state=1)

# transform arrays into pandas df and series
colnames = ['var_'+str(i) for i in range(12)]
X = pd.DataFrame(X, columns=colnames)
return X

X1 = make_data()

set up the selector

tr2 = SmartCorrelatedSelection( variables=None, method="pearson", threshold=0.8, missing_values="raise", selection_method="variance", estimator=None, )

Xt = tr2.fit_transform(X1) tr2.features_todrop ['var_0', 'var_4', 'var_6', 'var_9']

Expected behavior tr2.variables_ should give ['var_1', 'var_10', 'var_11', 'var_2', 'var_3', 'var_5', 'var_7', 'var_8']

instead I get ['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'var_6', 'var_7', 'var_8', 'var_9', 'var_10', 'var_11']

solegalli commented 3 months ago

Hi @jnsofini

the attribute variables_ shows the variables that were evaluated during the selection process. If variables=None when you set up the transformer, then variables_ will be all numerical variables seen during fit(). If variables=[var1, var2,var3], then variables_ will also be [var1, var2,var3].

If you want to obtain the variables that were selected, you can use support in combination with feature_names_in_ or get_feature_names_out(). These 2 are exactly the same as the ones supported in sklearn.