Preprocessing on specific features

EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

http://epistasislab.github.io/tpot/

GNU Lesser General Public License v3.0

9.62k stars 1.56k forks source link

Preprocessing on specific features #1110

Open hanshupe opened 3 years ago

hanshupe commented 3 years ago

I see that after the TPOT optimization a preprocessor like robustScaler was selected. I wonder if it's possible that robustScaler is applied not on the entire set of features but only on a few were it makes sense. One feature may have outliers and require robustScaler(0.1, 0.9), but the other features not. Can this be considered with TPOT?

weixuanfu commented 3 years ago

I think current TPOT may not support this kind of application. But I think it need adding the ColumnTransformer from scikit-learn into TPOT.

edwardyu commented 3 years ago

Interested in working on this. ColumnTransformer takes a list of columns. How to make this list available to the genetic algorithm?

edwardyu commented 3 years ago

OK I have an idea. In TPOTBase.fit(), add a hook called _dynamically_modify_config_dict(), which will modify config_dict with a dict like this:

    'tpot.builtins.ColumnTransformer': {
        'transformers': [''sklearn.preprocessing.StandardScaler', 'sklearn.preprocessing.RobustScaler', ...],
        'include_col_1': [True, False],
        'include_col_2': [True, False],
        ...
        'include_col_n': [True, False],
    },

What do you think @weixuanfu ?