automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.62k stars 1.28k forks source link

how to correctly set ensemble_size=0 #637

Closed smilesun closed 5 years ago

smilesun commented 5 years ago

Continuing on #451, since time_left_for_this_task is not a very sensible budget in our application scenario due to differences of hardware and working load, we decide to use runcount_limit, following the example below


import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification                                               
X, y = sklearn.datasets.load_digits(return_X_y=True)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=50000,
        per_run_time_limit=30,
        #tmp_folder='/tmp/autosklearn_cv_example_tmp',
        #output_folder='/tmp/autosklearn_cv_example_out',
        #delete_tmp_folder_after_terminate=False,
        resampling_strategy='cv',
        initial_configurations_via_metalearning=0,
        ensemble_size = 0,
        smac_scenario_args={'runcount_limit': 5},
        resampling_strategy_arguments={'folds': 5}
    )

    # fit() changes the data in place, but refit needs the original data. We
    # therefore copy the data. In practice, one should reload the data
    automl.fit(X_train.copy(), y_train.copy(), dataset_name='digits')
    # During fit(), models are fit on individual cross-validation folds. To use
    # all available data, we call refit() which trains all models in the
    # final ensemble on the whole dataset.
    automl.refit(X_train.copy(), y_train.copy())

    print(automl.show_models())

    predictions = automl.predict(X_test)
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

we get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-263240bbee7e> in <module>()
----> 1 main()

<ipython-input-7-a2ebfbedfe19> in main()
     20     # all available data, we call refit() which trains all models in the
     21     # final ensemble on the whole dataset.
---> 22     automl.refit(X_train.copy(), y_train.copy())
     23     print(automl.show_models())
     24     predictions = automl.predict(X_test)

~/anaconda3/lib/python3.6/site-packages/autosklearn/estimators.py in refit(self, X, y)
    495
    496         """
--> 497         self._automl[0].refit(X, y)
    498         return self
    499

~/anaconda3/lib/python3.6/site-packages/autosklearn/automl.py in refit(self, X, y)
    907                              (self._n_outputs, _n_outputs))
    908
--> 909         return super().refit(X, y)
    910
    911     def fit_ensemble(self, y, task=None, metric=None, precision='32',

~/anaconda3/lib/python3.6/site-packages/autosklearn/automl.py in refit(self, X, y)
    522         # Refit is not applicable when ensemble_size is set to zero.
    523         if self.ensemble_ is None:
--> 524             raise ValueError("Refit can only be called if 'ensemble_size != 0'")
    525
    526         random_state = np.random.RandomState(self._seed)

ValueError: Refit can only be called if 'ensemble_size != 0'
smilesun commented 5 years ago

if I change refit to fit_ensemble, i get the following error

[WARNING] [2019-02-19 17:57:37,088:EnsembleBuilder(1):digits] Error loading /tmp/autosklearn_tmp_8001_9848/.auto-sklearn/predictions_ensemble/predictions_ensemble_1_5.npy: Traceback (
most recent call last):                                                                                                                                                                
  File "/home/sunxd/anaconda3/lib/python3.6/site-packages/autosklearn/ensemble_builder.py", line 321, in read_ensemble_preds                                                           
    all_scoring_functions=False)                                                                                                                                                       
  File "/home/sunxd/anaconda3/lib/python3.6/site-packages/autosklearn/metrics/__init__.py", line 262, in calculate_score                                                               
    if task_type not in TASK_TYPES:                                                                                                                                                    
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()  
smilesun commented 5 years ago

@mfeurer , is it ok to first call fit_ensemble with ensemble_size = 1 then call refit as follows? it works at least

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification                                               
X, y = sklearn.datasets.load_digits(return_X_y=True)
   X_train, X_test, y_train, y_test = \
       sklearn.model_selection.train_test_split(X, y, random_state=1)

   automl = autosklearn.classification.AutoSklearnClassifier(
       time_left_for_this_task=50000,
       per_run_time_limit=30,
       #tmp_folder='/tmp/autosklearn_cv_example_tmp',
       #output_folder='/tmp/autosklearn_cv_example_out',
       #delete_tmp_folder_after_terminate=False,
       resampling_strategy='cv',
       initial_configurations_via_metalearning=0,
       ensemble_size = 0,
       smac_scenario_args={'runcount_limit': 5},
       resampling_strategy_arguments={'folds': 5}
   )

   # fit() changes the data in place, but refit needs the original data. We
   # therefore copy the data. In practice, one should reload the data
   automl.fit(X_train.copy(), y_train.copy(), dataset_name='digits')
   # During fit(), models are fit on individual cross-validation folds. To use
   # all available data, we call refit() which trains all models in the
   # final ensemble on the whole dataset.
   auml.fit_ensemble(y_train.copy(), ensemble_size = 1)
   automl.refit(X_train.copy(), y_train.copy())

   print(automl.show_models())

   predictions = automl.predict(X_test)
   print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
mfeurer commented 5 years ago

Sorry for the slow response.

Yes, your latest example is the correct way to go. What happens is that during fit() you don't build an ensemble, therefore, refit() cannot be applied to anything. By calling fit_ensemble() you build an ensemble, which can then be refit on the full training data (including the validation set which was split off during the hyperparameter optimization process).

zhuygln commented 5 years ago

@mfeurer So what is the difference between refit() and fit_ensemble()? Do you mean refit() use train only (as default holdout, 67% of the split) and test against validation( the other 33%)? If run as the example in @smilesun last post, call fit(), fit_ensemble(), refit(), does it actually takes 3 times of the time limit (3*3600 sec as default)?