EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.74k stars 1.57k forks source link

Using TPOT's GP framework to select features #1078

Open sokol11 opened 4 years ago

sokol11 commented 4 years ago

Hi. I wonder if it's possible to use TPOT's genetic programming framework to select the best subset of features. I already have an idea of what classification algorithm and parameters work well and I do not want to optimize the entire ML pipeline. I just would like to create a population of smaller feature sets out of ~1000 features I have and use cross-over/mutation to find the best set while using my static estimator for model evaluation. Can I do that with TPOT? If so, how might I go about doing it? Thanks!

weixuanfu commented 4 years ago

TPOT has a built-in FeatureSetSelector (see this link) for helping user to select best feature set based on priori expert knowledge. Also you may limit ML hyperparameter space in config_dict (like using a static estimator in the dictionary) and fix the pipeline with template parameter.

sokol11 commented 4 years ago

Thank you @weixuanfu. Yes, I saw FeatureSetSelector but as I understood it is used to specify a static set of features as opposed to optimize the feature set composition. Am I correct in understanding that TPOT does not select features based on genetic optimization, i.e. binary coding all the features and using genetic programming to find the best feature subset? Thanks.

weixuanfu commented 4 years ago

Your understanding is correct. TPOT cannot select features using GA without the static set of features based on priori expert knowledge. Maybe it is a good enhancement function to add into TPOT.

sokol11 commented 4 years ago

Understood. Thanks!

weixuanfu commented 4 years ago

Alternatively, you may use template like Selector-Classifier to indirectly select features by using GP to optimize best Selector.