automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.62k stars 1.28k forks source link

How to get the parameter and result of the best model #872

Closed lifeodyssey closed 4 years ago

lifeodyssey commented 4 years ago

Hi, I have trained autosklearn classifier and with my own data. But I wanted to extract the parameters of its best model so I didn't have to train again when I had new data to classify. I have tried cv_results and show_models. But they gave me a lot of models and their parameters. I can't find which is the best model autosklearn selected and its parameters. Could you please help me with this problem?

mfeurer commented 4 years ago

Auto-sklearn by default fits an ensemble, and that's what is printed if you use show_models. If you want the single best model, you can get the index of the best model with np.argmax(automl.cv_results_['mean_test_score']) and then access the hyperparameters with automl.cv_results_['params']. See here for a documentation of cv_results_. Alternatively, if you want to only have a single model, you can pass the argument ensemble_size=1 to the AutoSklearnClassifier and it will create an ensemble of size 1, i.e. the single best model.

mfeurer commented 4 years ago

Okay, here we go: https://automl.github.io/auto-sklearn/development/examples/40_advanced/example_get_pipeline_components.html#detailed-statistics-about-the-search-part-1

You can find code to get the best model at the bottom of the linked section. The example is so far only available in the documentation of the development branch, but also works for the master branch.

shabir1 commented 3 years ago

@mfeurer I set ensemble_size=10 and run the auto sklearn for 1 hour, model.cvresults contains 243 rows, 240 shows success status and 3 shows timeout. but when I print model.show_models() it prints only 3 models. How would I know which are those 3 models in cvresults. I tried to sort cvresults on mean_test_score key but the top 3 configs of cvresults are not matching with show_models() results.

shabir1 commented 3 years ago
@mfeurer Top 4 configs when sort cv_results_ on mean_test_score. mean_test_score mean_fit_time param_classifier:choice rank_test_scores status budgets param_balancing:strategy
0.898388227049755 3.68728971481323 extra_trees 1 Success 0 weighting
0.896986685353889 6.91696286201477 adaboost 2 Success 0 none
0.894183601962158 3.38019490242004 extra_trees 3 Success 0 none
0.893482831114226 3.80625748634338 extra_trees 4 Success 0 weighting

But in show_models() there is no adaboost and If I set ensemble_size=10 why I got only 3 models from show_models()

Huertas97 commented 2 years ago

@mfeurer Top 4 configs when sort cv_results_ on mean_test_score. mean_test_score mean_fit_time param_classifier:choice rank_test_scores status budgets param_balancing:strategy 0.898388227049755 3.68728971481323 extra_trees 1 Success 0 weighting 0.896986685353889 6.91696286201477 adaboost 2 Success 0 none 0.894183601962158 3.38019490242004 extra_trees 3 Success 0 none 0.893482831114226 3.80625748634338 extra_trees 4 Success 0 weighting

But in show_models() there is no adaboost and If I set ensemble_size=10 why I got only 3 models from show_models()

My impression is that if you set ensemble_size=10 you are specifying the maximum number of different models to include in an ensemble. However, not more models means better results. Thus, even if you specify up to 10 models the best ensemble can be composed of just 3 models.

The same impression seems to appear for the difference between the models that compose the best ensemble and the best ranked models in cv_results_. Unfortunately, not much information is reported about what cv_results_ is representing. My common sense tells me that cv_results_ reports only single models. Therefore, the best ensemble does not necessarily need to include the top-k best single models.

mfeurer commented 2 years ago

We use https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf with repetitions to build ensembles. Thus, the number of different models that end up in the ensemble is usually lower than ensemble_size. Have a look at the newly introduced leaderboard to get some further information on what's in the ensemble and what was searched.

cv_results_ contains single runs.

IF you have further questions, please open a new issue.