EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.76k stars 1.57k forks source link

How to get the best feature set using tpot? #988

Open Selva-sys opened 4 years ago

Selva-sys commented 4 years ago

Hello,

I recently started using tpot for data analysis and currently in a situation where I have to select the best features (like best parameters) that leads to high performance metric. In my case, it's f1-score

Context of the question

I ran tpot classifier my dataset to select best performing model/pipeline.

Now what I am looking for is to get the best feature subset. I already referred issue no 710 and it doesn't help because am looking for a subset of features from all of my feature which will lead to high accuracy.

I was assuming that the best feature subset is the direct result of optimization (purpose of genetic programming) and feature importance is the derived result. Am I right?

I did see that tpot has FeatureSetSelector but why are we expected to key in the feature subset size as shown below? can anyone help me understand what this code does? Based on my knowledge, we provide csv file which has info feature subsets size and feature names. Out of the 4 feature subsets present in csv, we select one subset . Am I right?

`classifier_config_dict['tpot.builtins.FeatureSetSelector'] = {
    'subset_list': ['https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/subset_test.csv'],
    'sel_subset': [0,1] # select only one feature set, a list of index of subset in the list above
    #'sel_subset': list(combinations(range(3), 2)) # select two feature sets
}`

But may I know why do we have to provide this? How can we do this like genetic algorithm optimzation approach which finds the best feature set upon trying multiple combinations.

Can someone help me with this on a simple example like Boston or Iris dataset please?

weixuanfu commented 4 years ago

FeatureSetSelector is a special new operator in TPOT. This operator enables feature selection based on priori expert knowledge. So those feature sets are not get from genetic programing in TPOT. We added this operator for analysis big genomes data with priori expert knowledge in biological pathways. Please check this link for more details.

Selva-sys commented 4 years ago

Sure. got it. Thanks for the prompt response. Am I then right to understand that tpot currently doesn't support best feature subset selection? I mean based on optimization, getting the best subset of features

weixuanfu commented 4 years ago

I think there is one hacky way for feature selection based on optimization. You could try TPOT with template="SelectFromModel-Classifier" or template="Selector-Classifier" and then check which features are selected in the 1st step of the fitted_pipeline_ attribute of fitted TPOTClassifier object.

SSMK-wq commented 4 years ago

Hi,

Through fitted_pipeline attribute, I can only see the pipeline and the attributes, Am I right?

Later, I have to execute the pipeline and use get_support to get the list of features. Am I right?

1) So in this case, it has picked RFE (refer below) as feature selection approach after trying several other feature selection approaches?

2) Later once RFE is chosen, it has tried combinations of different parameters and finally arrived at this parameters? (refer below)

3) I see in the evaluated_individuals, it only has RFE, Select percentile, variance threshold which keeps repeating. Will tpot not try other methods like LASSO,?

4) So am I right to understand that tpot tried multiple values for max_parameters and chose the best n features? It wasn't using the default value for max_features. Am I right?

5) I might be wrong here. Is it possible for tpot to miss different combinations of hyperparameters? I mean for example, it produces an F1-score of 79 pc. But based on my previous manual experiements, I changed few values and was able to get an F1-score of 81. Just trying to understand why wasn't tpot able to provide the best F1-score? can help me understand this

tpot classifier

tpot = TPOTClassifier(generations=5, population_size=5, scoring='f1',verbosity=2, 
                      template='Selector-Classifier',random_state=42)
tpot.fit(X_train_std, y_train)
print(tpot.score(X_test_std, y_test))
tpot.export('tpot_digits_pipeline.py')
tpot.fitted_pipeline_

exported pipeline

features = X
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, y, random_state=42)

exported_pipeline = make_pipeline(
    RFE(estimator=ExtraTreesClassifier(criterion="gini", max_features=0.55, n_estimators=100), step=0.6000000000000001),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.45, min_samples_leaf=1, min_samples_split=9, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
y_pred = exported_pipeline.predict(testing_features)

**Feature output**

exported_pipeline.named_steps['rfe'].get_support()



![image](https://user-images.githubusercontent.com/30723825/71759629-ec4e4980-2eea-11ea-890e-0c6df349004f.png)
weixuanfu commented 4 years ago

Through fitted_pipeline attribute, I can only see the pipeline and the attributes, Am I right?

Yes, it is a scikit-learn Pipeline object.

Later, I have to execute the pipeline and use get_support to get the list of features. Am I right?

Yes, the pipeline is interpretable with Pipeline API.

1. So in this case, it has picked `RFE` (refer below) as feature selection approach after trying several other feature selection approaches?

2. Later once RFE is chosen, it has tried combinations of different parameters and finally arrived at this parameters? (refer below)

3. I see in the `evaluated_individuals`, it only has `RFE`, `Select percentile`, `variance threshold` which keeps repeating. Will `tpot` not try other methods like `LASSO`,?

4. So am I right to understand that `tpot` tried multiple values for  `max_parameters` and chose the best `n` features? It wasn't using the default value for max_features. Am I right?

5. I might be wrong here. Is it possible for `tpot` to miss different combinations of hyperparameters? I mean for example, it produces an `F1-score of 79 pc`. But based on my previous manual experiements, I changed few values and was able to get an `F1-score of 81`. Just trying to understand why wasn't `tpot` able to provide the best `F1-score`? can help me understand this

tpot classifier

tpot = TPOTClassifier(generations=5, population_size=5, scoring='f1',verbosity=2, 
                      template='Selector-Classifier',random_state=42)
tpot.fit(X_train_std, y_train)
print(tpot.score(X_test_std, y_test))
tpot.export('tpot_digits_pipeline.py')
tpot.fitted_pipeline_

exported pipeline

features = X
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, y, random_state=42)

exported_pipeline = make_pipeline(
    RFE(estimator=ExtraTreesClassifier(criterion="gini", max_features=0.55, n_estimators=100), step=0.6000000000000001),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.45, min_samples_leaf=1, min_samples_split=9, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
y_pred = exported_pipeline.predict(testing_features)

**Feature output**

exported_pipeline.named_steps['rfe'].get_support()


![image](https://user-images.githubusercontent.com/30723825/71759629-ec4e4980-2eea-11ea-890e-0c6df349004f.png)

Answers to 5 questions above: TPOT uses genetic programming to optimize pipeline, which means that TPOT randomly generates solutions in the initial population (generation 0) and then evolve them via crossover and mutation over generations. TPOT may not try all the estimators unless the number of generations X population size (or offspring size) is large enough. Also TPOT should also try combinations of hyperparameters of those scikit-API estimators via GP. Changing the hyperparamters/values of TPOTClassifier may changing best results herein since the optimization process via GP is changed.

Could you share the hyperparameters in your case here?