automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.63k stars 1.28k forks source link

Can we include/exclude dataPreProcessings algorithms? #1333

Closed shabir1 closed 2 years ago

shabir1 commented 2 years ago

Can we include/exclude data preprocessing algorithms?

What configuration do I have to set if I don't need any data preprocessing, Or if I want to use only specific feature preprocessing and data preprocessing algorithms?

I tried:

autosklearn.classification.AutoSklearnClassifier( 
            time_left_for_this_task=150,
            include={
                       'data_preprocessor': ['NoPreprocessing']
                        },
)

Bot got error 
ValueError: The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']
eddiebergman commented 2 years ago

For no data-preprocessing, please see this example. If your data is dirty in any way then this will cause failures as the pipeline relies on cleaned data, which the data preprocessing does.

For no feature engineering, please see the docs here.

shabir1 commented 2 years ago

@eddiebergman Tried

from autosklearn.pipeline.components.feature_preprocessing.no_preprocessing import NoPreprocessing
autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)

clf = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    include={
        'data_preprocessor': ['NoPreprocessing']
    },
    # Bellow two flags are provided to speed up calculations
    # Not recommended for a real implementation
    initial_configurations_via_metalearning=0,
    smac_scenario_args={'runcount_limit': 5},
)
clf.fit(X_train, y_train)

Got below error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-bcaf679fd90c> in <module>
----> 1 automl.fit(x_train.copy(), y_train.copy())

~/.local/lib/python3.8/site-packages/autosklearn/estimators.py in fit(self, X, y, X_test, y_test, feat_type, dataset_name)
    937         self.target_type = target_type
    938 
--> 939         super().fit(
    940             X=X,
    941             y=y,

~/.local/lib/python3.8/site-packages/autosklearn/estimators.py in fit(self, **kwargs)
    328         if self.automl_ is None:
    329             self.automl_ = self.build_automl()
--> 330         self.automl_.fit(load_models=self.load_models, **kwargs)
    331 
    332         return self

~/.local/lib/python3.8/site-packages/autosklearn/automl.py in fit(self, X, y, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models)
   1913         load_models: bool = True,
   1914     ):
-> 1915         return super().fit(
   1916             X, y,
   1917             X_test=X_test,

~/.local/lib/python3.8/site-packages/autosklearn/automl.py in fit(self, X, y, task, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models, is_classification)
    790         # like this we can't use some of the preprocessing methods in case
    791         # the data became sparse)
--> 792         self.configuration_space, configspace_path = self._create_search_space(
    793             self._backend.temporary_directory,
    794             self._backend,

~/.local/lib/python3.8/site-packages/autosklearn/automl.py in _create_search_space(self, tmp_dir, backend, datamanager, include, exclude)
   1853         self._stopwatch.start_task(task_name)
   1854         configspace_path = os.path.join(tmp_dir, 'space.json')
-> 1855         configuration_space = pipeline.get_configuration_space(
   1856             datamanager.info,
   1857             include=include,

~/.local/lib/python3.8/site-packages/autosklearn/util/pipeline.py in get_configuration_space(info, include, exclude)
     31         return _get_regression_configuration_space(info, include, exclude)
     32     else:
---> 33         return _get_classification_configuration_space(info, include, exclude)
     34 
     35 

~/.local/lib/python3.8/site-packages/autosklearn/util/pipeline.py in _get_classification_configuration_space(info, include, exclude)
     86     }
     87 
---> 88     return SimpleClassificationPipeline(
     89         dataset_properties=dataset_properties,
     90         include=include, exclude=exclude).\

~/.local/lib/python3.8/site-packages/autosklearn/pipeline/classification.py in __init__(self, config, steps, dataset_properties, include, exclude, random_state, init_params)
     83         if 'target_type' not in dataset_properties:
     84             dataset_properties['target_type'] = 'classification'
---> 85         super().__init__(
     86             config=config,
     87             steps=steps,

~/.local/lib/python3.8/site-packages/autosklearn/pipeline/base.py in __init__(self, config, steps, dataset_properties, include, exclude, random_state, init_params)
     52         self._validate_include_exclude_params()
     53 
---> 54         self.config_space = self.get_hyperparameter_search_space()
     55 
     56         if config is None:

~/.local/lib/python3.8/site-packages/autosklearn/pipeline/base.py in get_hyperparameter_search_space(self, dataset_properties)
    238         """
    239         if not hasattr(self, 'config_space') or self.config_space is None:
--> 240             self.config_space = self._get_hyperparameter_search_space(
    241                 include=self.include, exclude=self.exclude,
    242                 dataset_properties=self.dataset_properties)

~/.local/lib/python3.8/site-packages/autosklearn/pipeline/classification.py in _get_hyperparameter_search_space(self, include, exclude, dataset_properties)
    184             dataset_properties['sparse'] = False
    185 
--> 186         cs = self._get_base_search_space(
    187             cs=cs, dataset_properties=dataset_properties,
    188             exclude=exclude, include=include, pipeline=self.steps)

~/.local/lib/python3.8/site-packages/autosklearn/pipeline/base.py in _get_base_search_space(self, cs, dataset_properties, exclude, include, pipeline)
    350                                         include.get(node_name),
    351                                         exclude.get(node_name))
--> 352                 sub_config_space = node.get_hyperparameter_search_space(
    353                     dataset_properties, include=choices_list)
    354                 cs.add_configuration_space(node_name, sub_config_space)

~/.local/lib/python3.8/site-packages/autosklearn/pipeline/components/data_preprocessing/__init__.py in get_hyperparameter_search_space(self, dataset_properties, default, include, exclude)
    118         cs.add_hyperparameter(preprocessor)
    119         for name in available_preprocessors:
--> 120             preprocessor_configuration_space = available_preprocessors[name](
    121                 dataset_properties=dataset_properties). \
    122                 get_hyperparameter_search_space(dataset_properties)

TypeError: __init__() got an unexpected keyword argument 'dataset_properties'
shabir1 commented 2 years ago

@eddiebergman I have to create my own NoPreprocessing Class ?

eddiebergman commented 2 years ago

@shabir1 yes, the code of which can be seen and modified in the example, we may eventually include it as a native part of the package but as we require data preprocessing for sklearn to work, we don't provide it as a default option.

shabir1 commented 2 years ago

@eddiebergman Thank you

shabir1 commented 2 years ago

@eddiebergman What are the possible values for data_preprocessor and feature_preprocessor.

  include={
        'data_preprocessor': [?], 
        'feature_preprocessor': [?]
    }
shabir1 commented 2 years ago

@eddiebergman I found the possible values

'feature_preprocessor':  ['densifier', 'extra_trees_preproc_for_classification', 'fast_ica', 'feature_agglomeration', 'kernel_pca', 'kitchen_sinks', 'liblinear_svc_preprocessor', 'no_preprocessing', 'nystroem_sampler', 'pca', 'polynomial', 'random_trees_embedding', 'select_percentile_classification', 'select_rates_classification', 'truncatedSVD']

data_preprocessor : ['feature_type', 'NoPreprocessing']

In feature_type there are different data preprocessors, Can we exclude/include few from those, if yes then how?

eddiebergman commented 2 years ago

You can use the include, exclude parameters here. They are however mutually exclusive, you can only specify one.

shabir1 commented 2 years ago

@eddiebergman I am talking about data_preprocessor,

include={
        'data_preprocessor': ['feature_type']
    }
or 
exclude={
        'data_preprocessor': ['feature_type']
   }

We can include or exclude feature_type only because there are only two possible values for 'data_preprocessor': ['feature_type', 'NoPreProcessing']. but I want to exclude hot_encoding or other data preprocessing, how to do that

eddiebergman commented 2 years ago

Currently not possible, you can preprocess it however you like before hand and use NoPreProcessing then. The reason we don't have it at the moment is because data preprocessing is applied column wise and our structure is flexible enough to handle that right now.

shabir1 commented 2 years ago

Okay, thank you