automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.53k stars 1.27k forks source link

How to use BaseCrossValidator object #599

Closed irivo closed 5 years ago

irivo commented 5 years ago

Hello. I'm trying to modify the cross validation example https://automl.github.io/auto-sklearn/master/examples/example_crossvalidation.html#sphx-glr-examples-example-crossvalidation-py , to use BaseCrossValidator object as a resampling_strategy argument, for example, LeaveOneOut, but I just can't figure out how to do it.

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
from sklearn.model_selection import LeaveOneOut
import autosklearn.classification
tmp_folder = '/mnt/e/autosklearn_parallel_example_tmp'
output_folder = '/mnt/e/autosklearn_parallel_example_out'

def main():
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=120,
        per_run_time_limit=30,
        tmp_folder=tmp_folder,
        output_folder=output_folder,
        delete_tmp_folder_after_terminate=False,
        #first try
#        resampling_strategy='LeaveOneOut',
#        resampling_strategy_arguments={},
        #second try
#        resampling_strategy='TrainEvaluator',
#        resampling_strategy_arguments={'LeaveOneOut': {}},
        #third try
#        resampling_strategy='BaseCrossValidator',
#        resampling_strategy_arguments={'LeaveOneOut': {}},
        #fourth try
        resampling_strategy=LeaveOneOut(),
        resampling_strategy_arguments={},
    )

    # fit() changes the data in place, but refit needs the original data. We
    # therefore copy the data. In practice, one should reload the data
    automl.fit(X_train.copy(), y_train.copy(), dataset_name='breast_cancer')
    # During fit(), models are fit on individual cross-validation folds. To use
    # all available data, we call refit() which trains all models in the
    # final ensemble on the whole dataset.
    automl.refit(X_train.copy(), y_train.copy())

    print(automl.show_models())

    predictions = automl.predict(X_test)
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

if __name__ == '__main__':
    main()

every time I get an error:

Traceback (most recent call last): File "test_420_ASKL.py", line 52, in main() File "test_420_ASKL.py", line 39, in main automl.fit(X_train.copy(), y_train.copy(), dataset_name='breast_cancer') File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/estimators.py", line 500, in fit dataset_name=dataset_name, File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/estimators.py", line 267, in fit self._automl.fit(*args, **kwargs) File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/automl.py", line 965, in fit only_return_configuration_space=only_return_configuration_space, File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/automl.py", line 203, in fit only_return_configuration_space, File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/automl.py", line 322, in _fit and not issubclass(self._resampling_strategy, BaseCrossValidator)\ File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/abc.py", line 228, in subclasscheck if issubclass(subclass, scls): File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/abc.py", line 232, in subclasscheck cls._abc_negative_cache.add(subclass) File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/_weakrefset.py", line 84, in add self.data.add(ref(item, self._remove)) TypeError: cannot create weak reference to 'str' object

on the fourth try I get an error:

Traceback (most recent call last): File "test_420_ASKL.py", line 55, in main() File "test_420_ASKL.py", line 42, in main automl.fit(X_train.copy(), y_train.copy(), dataset_name='breast_cancer') File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/estimators.py", line 500, in fit dataset_name=dataset_name, File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/estimators.py", line 267, in fit self._automl.fit(*args, **kwargs) File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/automl.py", line 965, in fit only_return_configuration_space=only_return_configuration_space, File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/automl.py", line 203, in fit only_return_configuration_space, File "/tmp/yes/envs/AutoSLK_42/lib/python3.6/site-packages/autosklearn/automl.py", line 326, in _fit self._resampling_strategy) ValueError: Illegal resampling strategy: LeaveOneOut()

How to do it right?

khenrix commented 5 years ago

Hey, I had the same problem. I solved it by feeding all the arguments for the Cross-validator object.

Example: automl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=120, per_run_time_limit=30, tmp_folder='/tmp/autosklearn_cv_example_tmp', output_folder='/tmp/autosklearn_cv_example_out', delete_tmp_folder_after_terminate=False, resampling_strategy=KFold, resampling_strategy_arguments={'n_splits': 5, 'shuffle': False, 'random_state': None}, )

As stated in the API Docs.

BaseCrossValidator or _RepeatedSplits or BaseShuffleSplit object: all arguments required by chosen class as specified in scikit-learn documentation. If arguments are not provided, scikit-learn defaults are used. If no defaults are available, an exception is raised. Refer to the ‘n_splits’ argument as ‘folds’.

irivo commented 5 years ago

Thank! Finally I did it, I read the documentation, but for some reason I misunderstood this moment.

alchav06 commented 4 years ago

Hi, I'm doing some tests using the BaseCrossValidation for LeaveOneOut, but I'm still having some errors. I already tried the suggested responses.

def main():
    X, y = sklearn.datasets.load_boston(return_X_y=True)
    feature_types = (['numerical'] * 3) + ['categorical'] + (['numerical'] * 9)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.regression.AutoSklearnRegressor( 
        time_left_for_this_task=120, 
        per_run_time_limit=30, 
        tmp_folder='/tmp/autosklearn_kf_tmp', 
        output_folder='/tmp/autosklearn_kf_out', 
        delete_tmp_folder_after_terminate=False, 
        resampling_strategy=LeaveOneOut, 
        resampling_strategy_arguments={}, 
    )
    automl.fit(X_train.copy(), y_train.copy(), dataset_name='boston',
               feat_type=feature_types)
    automl.refit(X_train.copy(), y_train.copy())
    print(automl.show_models())
    predictions = automl.predict(X_test)
    print("R2 score:", sklearn.metrics.r2_score(y_test, predictions))

if __name__ == '__main__':
    main()

The error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-c4b87957e7d1> in <module>
     36 
     37 if __name__ == '__main__':
---> 38     main()

<ipython-input-2-c4b87957e7d1> in main()
     23     # fit() changes the data in place, but refit needs the original data. We
     24     # therefore copy the data. In practice, one should reload the data
---> 25     automl.fit(X_train.copy(), y_train.copy(), dataset_name='breast_cancer')
     26     # During fit(), models are fit on individual cross-validation folds. To use
     27     # all available data, we call refit() which trains all models in the

~/anaconda3/envs/autosk/lib/python3.7/site-packages/autosklearn/estimators.py in fit(self, X, y, X_test, y_test, metric, feat_type, dataset_name)
    662             metric=metric,
    663             feat_type=feat_type,
--> 664             dataset_name=dataset_name,
    665         )
    666 

~/anaconda3/envs/autosk/lib/python3.7/site-packages/autosklearn/estimators.py in fit(self, **kwargs)
    335             )
    336             self._automl.append(automl)
--> 337             self._automl[0].fit(**kwargs)
    338         else:
    339             tmp_folder, output_folder = get_randomized_directory_names(

~/anaconda3/envs/autosk/lib/python3.7/site-packages/autosklearn/automl.py in fit(self, X, y, X_test, y_test, metric, feat_type, dataset_name, only_return_configuration_space, load_models)
    994             dataset_name=dataset_name,
    995             only_return_configuration_space=only_return_configuration_space,
--> 996             load_models=load_models,
    997         )
    998 

~/anaconda3/envs/autosk/lib/python3.7/site-packages/autosklearn/automl.py in fit(self, X, y, task, metric, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models)
    206             metric=metric,
    207             load_models=load_models,
--> 208             only_return_configuration_space=only_return_configuration_space,
    209         )
    210 

~/anaconda3/envs/autosk/lib/python3.7/site-packages/autosklearn/automl.py in _fit(self, datamanager, metric, load_models, only_return_configuration_space)
    341              'cv', 'partial-cv',
    342              'partial-cv-iterative-fit'] \
--> 343              and not issubclass(self._resampling_strategy, BaseCrossValidator)\
    344              and not issubclass(self._resampling_strategy, _RepeatedSplits)\
    345              and not issubclass(self._resampling_strategy, BaseShuffleSplit):

~/anaconda3/envs/autosk/lib/python3.7/abc.py in __subclasscheck__(cls, subclass)
    141         def __subclasscheck__(cls, subclass):
    142             """Override for issubclass(subclass, cls)."""
--> 143             return _abc_subclasscheck(cls, subclass)
    144 
    145         def _dump_registry(cls, file=None):

TypeError: issubclass() arg 1 must be a class
khenrix commented 4 years ago

Maybe someone can look into it a bit further and find the actual cause. You could try this and work from there. Gave me a terrible score, but compiled 🤷‍♂️ Best of luck!

image

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
from sklearn.model_selection import LeaveOneOut as LOO
import autosklearn.regression

class LeaveOneOut(LOO):

    def __init__(self, X):
        self.X = X

    def get_n_splits(self, X=None, y=None, groups=None):
        return super().get_n_splits(self.X)

def main():
    X, y = sklearn.datasets.load_boston(return_X_y=True)
    feature_types = (['numerical'] * 3) + ['categorical'] + (['numerical'] * 9)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.regression.AutoSklearnRegressor(
        time_left_for_this_task=120,
        per_run_time_limit=30,
        tmp_folder='/tmp/autosklearn_regression_example_tmp',
        output_folder='/tmp/autosklearn_regression_example_out',
        resampling_strategy=LeaveOneOut, 
        resampling_strategy_arguments={'X': X_train}
    )
    automl.fit(X_train.copy(), y_train.copy(), dataset_name='boston',
               feat_type=feature_types)
    automl.refit(X_train.copy(), y_train.copy())

    print(automl.show_models())
    predictions = automl.predict(X_test)
    print("R2 score:", sklearn.metrics.r2_score(y_test, predictions))

if __name__ == '__main__':
    main()
mfeurer commented 4 years ago

@alchav06 your issue appears to be different from the one this thread is about. Could you please open a new issue so wen can properly keep track of this problem?