Closed lifeodyssey closed 4 years ago
Auto-sklearn by default fits an ensemble, and that's what is printed if you use show_models
. If you want the single best model, you can get the index of the best model with np.argmax(automl.cv_results_['mean_test_score'])
and then access the hyperparameters with automl.cv_results_['params']
. See here for a documentation of cv_results_
. Alternatively, if you want to only have a single model, you can pass the argument ensemble_size=1
to the AutoSklearnClassifier
and it will create an ensemble of size 1, i.e. the single best model.
Okay, here we go: https://automl.github.io/auto-sklearn/development/examples/40_advanced/example_get_pipeline_components.html#detailed-statistics-about-the-search-part-1
You can find code to get the best model at the bottom of the linked section. The example is so far only available in the documentation of the development branch, but also works for the master branch.
@mfeurer I set ensemble_size=10 and run the auto sklearn for 1 hour, model.cvresults contains 243 rows, 240 shows success status and 3 shows timeout. but when I print model.show_models() it prints only 3 models. How would I know which are those 3 models in cvresults. I tried to sort cvresults on mean_test_score key but the top 3 configs of cvresults are not matching with show_models() results.
@mfeurer Top 4 configs when sort cv_results_ on mean_test_score .
mean_test_score |
mean_fit_time | param_classifier:choice | rank_test_scores | status | budgets | param_balancing:strategy |
---|---|---|---|---|---|---|
0.898388227049755 | 3.68728971481323 | extra_trees | 1 | Success | 0 | weighting |
0.896986685353889 | 6.91696286201477 | adaboost | 2 | Success | 0 | none |
0.894183601962158 | 3.38019490242004 | extra_trees | 3 | Success | 0 | none |
0.893482831114226 | 3.80625748634338 | extra_trees | 4 | Success | 0 | weighting |
But in show_models()
there is no adaboost
and If I set ensemble_size=10
why I got only 3 models from show_models()
@mfeurer Top 4 configs when sort
cv_results_
onmean_test_score
. mean_test_score mean_fit_time param_classifier:choice rank_test_scores status budgets param_balancing:strategy 0.898388227049755 3.68728971481323 extra_trees 1 Success 0 weighting 0.896986685353889 6.91696286201477 adaboost 2 Success 0 none 0.894183601962158 3.38019490242004 extra_trees 3 Success 0 none 0.893482831114226 3.80625748634338 extra_trees 4 Success 0 weightingBut in
show_models()
there is noadaboost
and If I setensemble_size=10
why I got only 3 models fromshow_models()
My impression is that if you set ensemble_size=10
you are specifying the maximum number of different models to include in an ensemble. However, not more models means better results. Thus, even if you specify up to 10 models the best ensemble can be composed of just 3 models.
The same impression seems to appear for the difference between the models that compose the best ensemble and the best ranked models in cv_results_
. Unfortunately, not much information is reported about what cv_results_
is representing. My common sense tells me that cv_results_
reports only single models. Therefore, the best ensemble does not necessarily need to include the top-k best single models.
We use https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf with repetitions to build ensembles. Thus, the number of different models that end up in the ensemble is usually lower than ensemble_size
. Have a look at the newly introduced leaderboard to get some further information on what's in the ensemble and what was searched.
cv_results_
contains single runs.
IF you have further questions, please open a new issue.
Hi, I have trained autosklearn classifier and with my own data. But I wanted to extract the parameters of its best model so I didn't have to train again when I had new data to classify. I have tried cv_results and show_models. But they gave me a lot of models and their parameters. I can't find which is the best model autosklearn selected and its parameters. Could you please help me with this problem?