bmurauer / pipelinehelper

scikit-helper to hot-swap pipeline elements
GNU General Public License v3.0
21 stars 9 forks source link

grid.best_estimator_.get_params() vague selected_model output #12

Open browshanravan opened 4 years ago

browshanravan commented 4 years ago

In my example code GaussianNB() was selected as the best estimator however it seems like the selected_model output from grid.best_estimator_.get_params() does not reflect this, although I have instantiated it as GaussianNB in the PipelineHelper. The selected_model does however show the parameters for GaussianNB() such as priors and var_smoothing. The available_models in output for grid.get_params().keys() looks fine though.

I suspect this has something to do with the fact that I have left the default parameters for GaussianNB() as they are and did not put anything in the grid_search.

here is the grid.best_estimator_.get_params() output

['clf',
 'clf__available_models',
 'clf__available_models__ExtraTreesClassifier',
 'clf__available_models__ExtraTreesClassifier__bootstrap',
 'clf__available_models__ExtraTreesClassifier__ccp_alpha',
 'clf__available_models__ExtraTreesClassifier__class_weight',
 'clf__available_models__ExtraTreesClassifier__criterion',
 'clf__available_models__ExtraTreesClassifier__max_depth',
 'clf__available_models__ExtraTreesClassifier__max_features',
 'clf__available_models__ExtraTreesClassifier__max_leaf_nodes',
 'clf__available_models__ExtraTreesClassifier__max_samples',
 'clf__available_models__ExtraTreesClassifier__min_impurity_decrease',
 'clf__available_models__ExtraTreesClassifier__min_impurity_split',
 'clf__available_models__ExtraTreesClassifier__min_samples_leaf',
 'clf__available_models__ExtraTreesClassifier__min_samples_split',
 'clf__available_models__ExtraTreesClassifier__min_weight_fraction_leaf',
 'clf__available_models__ExtraTreesClassifier__n_estimators',
 'clf__available_models__ExtraTreesClassifier__n_jobs',
 'clf__available_models__ExtraTreesClassifier__oob_score',
 'clf__available_models__ExtraTreesClassifier__random_state',
 'clf__available_models__ExtraTreesClassifier__verbose',
 'clf__available_models__ExtraTreesClassifier__warm_start',
 'clf__available_models__GaussianNB',
 'clf__available_models__GaussianNB__priors',
 'clf__available_models__GaussianNB__var_smoothing',
 'clf__available_models__RandomForestClassifier',
 'clf__available_models__RandomForestClassifier__bootstrap',
 'clf__available_models__RandomForestClassifier__ccp_alpha',
 'clf__available_models__RandomForestClassifier__class_weight',
 'clf__available_models__RandomForestClassifier__criterion',
 'clf__available_models__RandomForestClassifier__max_depth',
 'clf__available_models__RandomForestClassifier__max_features',
 'clf__available_models__RandomForestClassifier__max_leaf_nodes',
 'clf__available_models__RandomForestClassifier__max_samples',
 'clf__available_models__RandomForestClassifier__min_impurity_decrease',
 'clf__available_models__RandomForestClassifier__min_impurity_split',
 'clf__available_models__RandomForestClassifier__min_samples_leaf',
 'clf__available_models__RandomForestClassifier__min_samples_split',
 'clf__available_models__RandomForestClassifier__min_weight_fraction_leaf',
 'clf__available_models__RandomForestClassifier__n_estimators',
 'clf__available_models__RandomForestClassifier__n_jobs',
 'clf__available_models__RandomForestClassifier__oob_score',
 'clf__available_models__RandomForestClassifier__random_state',
 'clf__available_models__RandomForestClassifier__verbose',
 'clf__available_models__RandomForestClassifier__warm_start',
 'clf__optional',
 'clf__selected_model',
 'clf__selected_model__priors',
 'clf__selected_model__var_smoothing',
 'memory',
 'preprosessor',
 'preprosessor__C_Fimp',
 'preprosessor__C_Fimp__cat_imputer',
 'preprosessor__C_Fimp__cat_imputer__add_indicator',
 'preprosessor__C_Fimp__cat_imputer__copy',
 'preprosessor__C_Fimp__cat_imputer__fill_value',
 'preprosessor__C_Fimp__cat_imputer__missing_values',
 'preprosessor__C_Fimp__cat_imputer__strategy',
 'preprosessor__C_Fimp__cat_imputer__verbose',
 'preprosessor__C_Fimp__memory',
 'preprosessor__C_Fimp__onehot',
 'preprosessor__C_Fimp__onehot__categories',
 'preprosessor__C_Fimp__onehot__drop',
 'preprosessor__C_Fimp__onehot__dtype',
 'preprosessor__C_Fimp__onehot__handle_unknown',
 'preprosessor__C_Fimp__onehot__sparse',
 'preprosessor__C_Fimp__steps',
 'preprosessor__C_Fimp__verbose',
 'preprosessor__N_Fimp',
 'preprosessor__N_Fimp__memory',
 'preprosessor__N_Fimp__num_imputer',
 'preprosessor__N_Fimp__num_imputer__add_indicator',
 'preprosessor__N_Fimp__num_imputer__copy',
 'preprosessor__N_Fimp__num_imputer__fill_value',
 'preprosessor__N_Fimp__num_imputer__missing_values',
 'preprosessor__N_Fimp__num_imputer__strategy',
 'preprosessor__N_Fimp__num_imputer__verbose',
 'preprosessor__N_Fimp__steps',
 'preprosessor__N_Fimp__verbose',
 'preprosessor__n_jobs',
 'preprosessor__remainder',
 'preprosessor__sparse_threshold',
 'preprosessor__transformer_weights',
 'preprosessor__transformers',
 'preprosessor__verbose',
 'steps',
 'verbose']
bmurauer commented 4 years ago

I'm afraid I don't understand your question:

In my example code GaussianNB() was selected as the best estimator however it seems like the selected_model output from grid.bestestimator.get_params() does not reflect this

In the above output, the lines

 'clf__selected_model',
 'clf__selected_model__priors',
 'clf__selected_model__var_smoothing',

suggest that the GaussianNB model was selected as the best estimator, as you describe. What am I missing?

browshanravan commented 4 years ago

shouldn't it be written as 'clf__selected_model__GaussianNB__priors' instead of 'clf__selected_model__priors'? it is not very easy to determine that the selected model was GaussianNB just by looking at parameters that are written in front of the clf__selected_model. It is not very explicit given that I have specially defined ("GaussianNB", GaussianNB()) in my PipelineHelper in my example code.

This will become specially problematic if you have RandomForestClassifier and ExtraTreesClassifier in your PipelineHelper, both of which share almost identical parameters and you have to figure out which one was chosen as selected_model when calling grid.best_estimator_.get_params()

bmurauer commented 4 years ago

Ah OK, I now see what you mean. I agree that this would be helpful, but I'll have to think about the internal changes that this fix would imply.

browshanravan commented 4 years ago

If this is not a trivial matter, then that is fine. A user can always use the grid.best_params_ command and they can see what the best chosen parameter is. I just thought it would be nice to have it in the grid.best_estimator_.get_params() command.

kaelgabriel commented 4 years ago

image

I like to play with something like this, specially when one is using two scoring functions:

grid = GridSearchCV(pipe, params, scoring='accuracy', verbose=0, n_jobs=-1)
grid.fit(X, y)
df_grid_search = pd.DataFrame(grid.cv_results_)
df_grid_search = df_grid_search.set_index('params')[['mean_fit_time','mean_score_time','mean_test_score',\
                'std_test_score','rank_test_score']]
df_grid_search.sort_values(by = 'rank_test_score').head(10)

or with more code-noise:

grid = GridSearchCV(pipe, params, scoring='accuracy', verbose=0, n_jobs=-1)
grid.fit(X, y)
df_grid_search = pd.DataFrame(grid.cv_results_)
df_grid_search['params'] = [str(list(x.values())).replace('(',"").replace(')',"") for x in df_grid_search['params']]
df_grid_search = df_grid_search.set_index('params')[['mean_fit_time','mean_score_time'] + \
                                    [x for x in df_grid_search.columns if ('rank_test' in x) or ('mean_test' in x)]]
df_grid_search.sort_values(by = [x for x in df_grid_search.columns if 'rank_test' in x]).head(10)