automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.62k stars 1.28k forks source link

When adding NoPreprocessing component to auto-sklearn, the lassoregression can run successfully, while the abess regression crashed #1661

Open belzheng opened 1 year ago

belzheng commented 1 year ago

Describe the bug

When adding NoPreprocessing component to auto-sklearn, the lassoregression can run successfully, while the abessregression crashed, both lassoregression and abessregression are written by my own, and they can both successfully run when NoPreprocessing not added. I wonder what is the problem that abessregression crashed when adding NoPreprocessing. The following are my snippest for reference:

1 2 3

Environment and installation:

Please give details about your installation:

aron-bram commented 1 year ago

Hi, could you please also post your code for the custom preprocessor that you passed in? Thanks in advance.

belzheng commented 1 year ago

Ok, here is my code for debugging:

from autosklearn.pipeline.constants import SPARSE, DENSE, UNSIGNED_DATA, INPUT
class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):
    def __init__(self, **kwargs):
        """This preprocessors does not change the data"""
        # Some internal checks makes sure parameters are set
        for key, val in kwargs.items():
            setattr(self, key, val)

    def fit(self, X, Y=None):
        return self

    def transform(self, X):
        return X

    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            "shortname": "NoPreprocessing",
            "name": "NoPreprocessing",
            "handles_regression": True,
            "handles_classification": True,
            "handles_multiclass": True,
            "handles_multilabel": True,
            "handles_multioutput": True,
            "is_deterministic": True,
            "input": (SPARSE, DENSE, UNSIGNED_DATA),
            "output": (INPUT,),
        }

    @staticmethod
    def get_hyperparameter_search_space(
        feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
    ):
        return ConfigurationSpace()  # Return an empty configuration as there is None

# Add NoPreprocessing component to auto-sklearn.
autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)
cs = NoPreprocessing.get_hyperparameter_search_space()
print(cs)
class kBinsDiscretizer(AutoSklearnPreprocessingAlgorithm):
    def __init__(self, n_bins, random_state=None):
        self.n_bins = n_bins
        self.random_state = random_state
        self.preprocessor = None

    def fit(self, X, y=None):
        self.n_bins = int(self.n_bins)

        from sklearn.preprocessing import KBinsDiscretizer

        self.preprocessor = KBinsDiscretizer(
            n_bins = self.n_bins,
            )
        self.preprocessor.fit(X, y)
        return self

    def transform(self, X):
        if self.preprocessor is None:
            raise NotImplementedError()
        return self.preprocessor.transform(X)

    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            "shortname": "kBinsDiscretizer",
            "name": "kBinsDiscretizer",
            "handles_regression": True,
            "handles_classification": True,
            "handles_multiclass": True,
            "handles_multilabel": True,
            "handles_multioutput": True,
            "is_deterministic": True,
            "input": (DENSE, UNSIGNED_DATA, SIGNED_DATA),
            "output": (DENSE, UNSIGNED_DATA, SIGNED_DATA),
        }

    @staticmethod
    def get_hyperparameter_search_space(
        feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
    ):
        cs = ConfigurationSpace()
        n_bins = UniformIntegerHyperparameter(
            name="n_bins", lower=2, upper=10, default_value=5
        )
        cs.add_hyperparameters([n_bins])
        return cs

# Add kbins component to auto-sklearn.
autosklearn.pipeline.components.feature_preprocessing.add_preprocessor(kBinsDiscretizer)
cs = kBinsDiscretizer.get_hyperparameter_search_space()
print(cs)
class AbessRegression(AutoSklearnRegressionAlgorithm):

    def __init__(self, random_state=None):
        #self.exchange_num = exchange_num
        self.random_state = random_state
        self.estimator = None

    def fit(self, X, y):
        from abess import LinearRegression
        self.estimator = LinearRegression()
        self.estimator.fit(X, y)
        return self

    def predict(self, X):
        if self.estimator is None:
            raise NotImplementedError
        return self.estimator.predict(X)

    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            'shortname': 'abess',
            'name': 'abess linear regression',
            'handles_regression': True,
            'handles_classification': False,
            'handles_multiclass': False,
            'handles_multilabel': False,
            'handles_multioutput': True,
            'is_deterministic': True,
            'input': (SPARSE, DENSE, UNSIGNED_DATA, SIGNED_DATA),
            'output': (PREDICTIONS,)
        }

    @staticmethod
    def get_hyperparameter_search_space(
        feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
    ):
        cs = ConfigurationSpace() 
        #exchange_num=UniformIntegerHyperparameter(
         #   name='exchange_num', lower=4, upper=5, default_value=5
        #)
        #cs.add_hyperparameters([exchange_num])
        return cs

# Add abesscomponent to auto-sklearn.
autosklearn.pipeline.components.regression.add_regressor(AbessRegression)
cs = AbessRegression.get_hyperparameter_search_space()
print(cs)
regaallp = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=60,
    per_run_time_limit=10,
    include={
        "data_preprocessor": ["NoPreprocessing"],
        'regressor': ['AbessRegression'],
        'feature_preprocessor':[
            'no_preprocessing',
            'polynomial',
            'kBinsDiscretizer',
        ],
    },
    memory_limit=6144,
)
regaallp.fit(X, y)
#yaallp_pred = regaallp.predict(X_test.values)

The error: TypeError Traceback (most recent call last) Cell In [10], line 15 1 regaallp = autosklearn.regression.AutoSklearnRegressor( 2 time_left_for_this_task=60, 3 per_run_time_limit=10, (...) 13 memory_limit=6144, 14 ) ---> 15 regaallp.fit(X, y)

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/autosklearn/estimators.py:1587, in AutoSklearnRegressor.fit(self, X, y, X_test, y_test, feat_type, dataset_name) 1576 raise ValueError( 1577 "Regression with data of type {} is " 1578 "not supported. Supported types are {}. " (...) 1582 "".format(target_type, supported_types) 1583 ) 1585 # Fit is supposed to be idempotent! 1586 # But not if we use share_mode. -> 1587 super().fit( 1588 X=X, 1589 y=y, 1590 X_test=X_test, 1591 y_test=y_test, 1592 feat_type=feat_type, 1593 dataset_name=dataset_name, 1594 ) 1596 return self

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/autosklearn/estimators.py:540, in AutoSklearnEstimator.fit(self, kwargs) 538 if self.automl is None: 539 self.automl = self.buildautoml() --> 540 self.automl.fit(load_models=self.load_models, kwargs) 542 return self

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/autosklearn/automl.py:2394, in AutoMLRegressor.fit(self, X, y, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models) 2383 def fit( 2384 self, 2385 X: SUPPORTED_FEAT_TYPES, (...) 2392 load_models: bool = True, 2393 ) -> AutoMLRegressor: -> 2394 return super().fit( 2395 X, 2396 y, 2397 X_test=X_test, 2398 y_test=y_test, 2399 feat_type=feat_type, 2400 dataset_name=dataset_name, 2401 only_return_configuration_space=only_return_configuration_space, 2402 load_models=load_models, 2403 is_classification=False, 2404 )

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/autosklearn/automl.py:962, in AutoML.fit(self, X, y, task, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models, is_classification) 959 except Exception as e: 960 # This will be called before the _fit_cleanup 961 self._logger.exception(e) --> 962 raise e 963 finally: 964 self._fit_cleanup()

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/autosklearn/automl.py:899, in AutoML.fit(self, X, y, task, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models, is_classification) 863 resamp_args = self._resampling_strategy_arguments 864 _proc_smac = AutoMLSMBO( 865 config_space=self.configuration_space, 866 dataset_name=self._dataset_name, (...) 892 trials_callback=self._get_trialscallback, 893 ) 895 ( 896 self.runhistory, 897 self.trajectory_, 898 self._budget_type, --> 899 ) = _proc_smac.run_smbo() 901 trajectory_filename = os.path.join( 902 self._backend.get_smac_output_directory_for_run(self._seed), 903 "trajectory.json", 904 ) 905 saveable_trajectory = [ 906 list(entry[:2]) 907 + [entry[2].getdictionary()] 908 + list(entry[3:]) 909 for entry in self.trajectory 910 ]

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/autosklearn/smbo.py:552, in AutoMLSMBO.run_smbo(self) 549 if self.trials_callback is not None: 550 smac.register_callback(self.trials_callback) --> 552 smac.optimize() 554 self.runhistory = smac.solver.runhistory 555 self.trajectory = smac.solver.intensifier.traj_logger.trajectory

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/smac/facade/smac_ac_facade.py:720, in SMAC4AC.optimize(self) 718 incumbent = None 719 try: --> 720 incumbent = self.solver.run() 721 finally: 722 self.solver.save()

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/smac/optimizer/smbo.py:273, in SMBO.run(self) 266 # Skip the run if there was a request to do so. 267 # For example, during intensifier intensification, we 268 # don't want to rerun a config that was previously ran 269 if intent == RunInfoIntent.RUN: 270 # Track the fact that a run was launched in the run 271 # history. It's status is tagged as RUNNING, and once 272 # completed and processed, it will be updated accordingly --> 273 self.runhistory.add( 274 config=run_info.config, 275 cost=float(MAXINT) 276 if num_obj == 1 277 else np.full(num_obj, float(MAXINT)), 278 time=0.0, 279 status=StatusType.RUNNING, 280 instance_id=run_info.instance, 281 seed=run_info.seed, 282 budget=run_info.budget, 283 ) 285 run_info.config.config_id = self.runhistory.config_ids[run_info.config] 287 self.tae_runner.submit_run(run_info=run_info)

File ~/miniconda3/envs/p38/lib/python3.8/site-packages/smac/runhistory/runhistory.py:257, in RunHistory.add(self, config, cost, time, status, instance_id, seed, budget, starttime, endtime, additional_info, origin, force_update) 223 """Adds a data of a new target algorithm (TA) run; 224 it will update data if the same key values are used 225 (config, instance_id, seed) (...) 253 Forces the addition of a config to the history 254 """ 256 if config is None: --> 257 raise TypeError("Configuration to add to the runhistory must not be None") 258 elif not isinstance(config, Configuration): 259 raise TypeError( 260 "Configuration to add to the runhistory is not of type Configuration, but %s" 261 % type(config) 262 )

TypeError: Configuration to add to the runhistory must not be None

I would be very appreciateful as I have been troubled by this problem for a long time.

aron-bram commented 1 year ago

Thanks for the extra info.

I ran your code on the diabetes dataset from sklearn (not sure what you used), and your custom feature/data preprocessors worked fine as expected. When I ran with abess' linear regressor included, I managed to reproduce your error.

You said that abess regressor runs without problems if you only include your regressor in the search. However, when I tried just including abess' linear regressor, the optimization still failed. In my case, the issue is with abass. Namely, I'm thinking that this open issue caused the sampled pipelines to crash. Unfortunately, this is stopping me from further helping you with debugging, unless abess resolves it.

What you could try, is take a look at my answer to another raised issue of yours at #1660, where I also explain how to find the runhistory file. You should take a look at the sampled configurations there, and see if there are any errors attached to the runs. I would expect, that you would find some errors raised by abess there. If you do find errors raised by abess there, then this issue should be closed, since it is not related to autosklearn.

Hope this will help you solve it.