feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.86k stars 309 forks source link

Request: selecting `variables` through a user-supplied function #589

Open david-cortes opened 1 year ago

david-cortes commented 1 year ago

Transformers in this library take an argument variables which is expected to be a list of column names.

Oftentimes, one has variables that follow some natural grouping, and would want to apply a given transformer to all variables that match some naming pattern. It's relatively easy to do this when there is a single modeling pipeline by creating python variables with their names, but oftentimes one wants to try for example the same transformer pipeline with different groups of features, or slight variations of e.g. earlier transformations, etc. and thus the exact list of variables would vary from one run to another, and the transformers would need to be re-defined.

Would be helpful if the transformers could also accept variables as a function that would be applied to the column names and return True or False as indicators of whether the transformer applies to each variable or not.

solegalli commented 1 year ago

Hey @david-cortes

This sounds like a very specific case. I am not sure how wide-spread its use would be.

Would you be able to provide an example? I can't really picture the scenario.

Thank you

david-cortes commented 1 year ago

A quick example for now: suppose I have a data frame with numeric features that have missing values, and I want to process it as follows:

In this case, the binary missing indicator columns should not get squared, since the output will be the same as the input, and one way would be by having the first transformer name those with a given suffix and then let the last transformer select columns without the suffix.

You might then say that one can simply pass the column names directly to the last transformer, but then suppose that I want to try two different models using different subsets of the features, or that I want to apply them to two datasets sharing similar contents (e.g. data from 1-30 days ago and data from 31-60 days ago, which might have similar but not entirely equal column names).

kylegilde commented 1 year ago

ColumnTransformer and make_column_selector support using callables to select columns.

david-cortes commented 1 year ago

ColumnTransformer and make_column_selector support using callables to select columns.

But those transformers from scikit-learn oftentimes force conversions between DataFrames and matrices, which is undesirable for the kind of transformations that feature_engine does.

ClaudioSalvatoreArcidiacono commented 1 year ago

If you want ColumnTransformer to return a Dataframe you can do it using the method set_output

For example:

from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import MinMaxScaler
import pandas as pd   
X = pd.DataFrame({
    "documents": ["First item", "second one here", "Is this the last?"],
    "width": [3, 4, 5],
})  
# "documents" is a string which configures ColumnTransformer to
# pass the documents column as a 1d array to the FeatureHasher
ct = ColumnTransformer(
    [("text_preprocess", FeatureHasher(input_type="string"), "documents"),
     ("num_preprocess", MinMaxScaler(), ["width"])],
      # This parameter ensures that original feature names are kept also in output DataFrame
      verbose_feature_names_out=False
)
# Ensures that a DataFrame is returned by transform
ct.set_output("pandas")

X_trans = ct.fit_transform(X)