Open Selva-sys opened 4 years ago
FeatureSetSelector is a special new operator in TPOT. This operator enables feature selection based on priori expert knowledge. So those feature sets are not get from genetic programing in TPOT. We added this operator for analysis big genomes data with priori expert knowledge in biological pathways. Please check this link for more details.
Sure. got it. Thanks for the prompt response. Am I then right to understand that tpot currently doesn't support best feature subset selection? I mean based on optimization, getting the best subset of features
I think there is one hacky way for feature selection based on optimization. You could try TPOT with template="SelectFromModel-Classifier"
or template="Selector-Classifier"
and then check which features are selected in the 1st step of the fitted_pipeline_
attribute of fitted TPOTClassifier object.
Hi,
Through fitted_pipeline
attribute, I can only see the pipeline
and the attributes, Am I right?
Later, I have to execute the pipeline and use get_support
to get the list of features. Am I right?
1) So in this case, it has picked RFE
(refer below) as feature selection approach after trying several other feature selection approaches?
2) Later once RFE is chosen, it has tried combinations of different parameters and finally arrived at this parameters? (refer below)
3) I see in the evaluated_individuals
, it only has RFE
, Select percentile
, variance threshold
which keeps repeating. Will tpot
not try other methods like LASSO
,?
4) So am I right to understand that tpot
tried multiple values for max_parameters
and chose the best n
features? It wasn't using the default value for max_features. Am I right?
5) I might be wrong here. Is it possible for tpot
to miss different combinations of hyperparameters? I mean for example, it produces an F1-score of 79 pc
. But based on my previous manual experiements, I changed few values and was able to get an F1-score of 81
. Just trying to understand why wasn't tpot
able to provide the best F1-score
? can help me understand this
tpot classifier
tpot = TPOTClassifier(generations=5, population_size=5, scoring='f1',verbosity=2,
template='Selector-Classifier',random_state=42)
tpot.fit(X_train_std, y_train)
print(tpot.score(X_test_std, y_test))
tpot.export('tpot_digits_pipeline.py')
tpot.fitted_pipeline_
exported pipeline
features = X
training_features, testing_features, training_target, testing_target = \
train_test_split(features, y, random_state=42)
exported_pipeline = make_pipeline(
RFE(estimator=ExtraTreesClassifier(criterion="gini", max_features=0.55, n_estimators=100), step=0.6000000000000001),
RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.45, min_samples_leaf=1, min_samples_split=9, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)
exported_pipeline.fit(training_features, training_target)
y_pred = exported_pipeline.predict(testing_features)
**Feature output**
exported_pipeline.named_steps['rfe'].get_support()
![image](https://user-images.githubusercontent.com/30723825/71759629-ec4e4980-2eea-11ea-890e-0c6df349004f.png)
Through
fitted_pipeline
attribute, I can only see thepipeline
and the attributes, Am I right?
Yes, it is a scikit-learn Pipeline object.
Later, I have to execute the pipeline and use
get_support
to get the list of features. Am I right?
Yes, the pipeline is interpretable with Pipeline API.
1. So in this case, it has picked `RFE` (refer below) as feature selection approach after trying several other feature selection approaches? 2. Later once RFE is chosen, it has tried combinations of different parameters and finally arrived at this parameters? (refer below) 3. I see in the `evaluated_individuals`, it only has `RFE`, `Select percentile`, `variance threshold` which keeps repeating. Will `tpot` not try other methods like `LASSO`,? 4. So am I right to understand that `tpot` tried multiple values for `max_parameters` and chose the best `n` features? It wasn't using the default value for max_features. Am I right? 5. I might be wrong here. Is it possible for `tpot` to miss different combinations of hyperparameters? I mean for example, it produces an `F1-score of 79 pc`. But based on my previous manual experiements, I changed few values and was able to get an `F1-score of 81`. Just trying to understand why wasn't `tpot` able to provide the best `F1-score`? can help me understand this
tpot classifier
tpot = TPOTClassifier(generations=5, population_size=5, scoring='f1',verbosity=2, template='Selector-Classifier',random_state=42) tpot.fit(X_train_std, y_train) print(tpot.score(X_test_std, y_test)) tpot.export('tpot_digits_pipeline.py') tpot.fitted_pipeline_
exported pipeline
features = X training_features, testing_features, training_target, testing_target = \ train_test_split(features, y, random_state=42) exported_pipeline = make_pipeline( RFE(estimator=ExtraTreesClassifier(criterion="gini", max_features=0.55, n_estimators=100), step=0.6000000000000001), RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.45, min_samples_leaf=1, min_samples_split=9, n_estimators=100) ) # Fix random state for all the steps in exported pipeline set_param_recursive(exported_pipeline.steps, 'random_state', 42) exported_pipeline.fit(training_features, training_target) y_pred = exported_pipeline.predict(testing_features)
**Feature output**
exported_pipeline.named_steps['rfe'].get_support()
![image](https://user-images.githubusercontent.com/30723825/71759629-ec4e4980-2eea-11ea-890e-0c6df349004f.png)
Answers to 5 questions above: TPOT uses genetic programming to optimize pipeline, which means that TPOT randomly generates solutions in the initial population (generation 0) and then evolve them via crossover and mutation over generations. TPOT may not try all the estimators unless the number of generations X population size (or offspring size) is large enough. Also TPOT should also try combinations of hyperparameters of those scikit-API estimators via GP. Changing the hyperparamters/values of TPOTClassifier may changing best results herein since the optimization process via GP is changed.
Could you share the hyperparameters in your case here?
Hello,
I recently started using
tpot
for data analysis and currently in a situation where I have to select the best features (like best parameters) that leads to high performance metric. In my case, it'sf1-score
Context of the question
I ran
tpot classifier
my dataset to select best performing model/pipeline.Now what I am looking for is to get the best feature subset. I already referred issue no 710 and it doesn't help because am looking for a subset of features from all of my feature which will lead to high accuracy.
I was assuming that the best feature subset is the direct result of optimization (purpose of genetic programming) and feature importance is the derived result. Am I right?
I did see that
tpot
hasFeatureSetSelector
but why are we expected to key in thefeature subset size
as shown below? can anyone help me understand what this code does? Based on my knowledge, we provide csv file which has info feature subsets size and feature names. Out of the 4 feature subsets present in csv, we select one subset . Am I right?But may I know why do we have to provide this? How can we do this like genetic algorithm optimzation approach which finds the best feature set upon trying multiple combinations.
Can someone help me with this on a simple example like Boston or Iris dataset please?