automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.58k stars 1.28k forks source link

[Question] How do I fix this issue? #1716

Open jordannelson0 opened 9 months ago

jordannelson0 commented 9 months ago

Here is my code:

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier

dataframe = read_csv("Spy.csv", skiprows=0)
dataset = dataframe.values
x = dataset[:, 0:9503]
y = dataset[:, 9503]
print(dataset)

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# define search
model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
                              initial_configurations_via_metalearning=0,
                              memory_limit=2000,
                              time_left_for_this_task=10 * 60,
                              per_run_time_limit=60,
                              n_jobs=24)
# perform the search
model.fit(x_train, y_train)

# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

Here is the warning I'm receiving:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jordan/Documents/Brighton_University/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 961, in fit
    self._logger.exception(e)
AttributeError: 'NoneType' object has no attribute 'exception'

During handling of the above exception, another exception occurred:

Followed by:

File "/home/jordan/Documents/Brighton_University/auto_scikit.py", line 23, in <module>
    model.fit(x_train, y_train)
AttributeError: 'NoneType' object has no attribute 'info'

I have no idea how to fix this, I have been looking for hours and trying different things - even changing datasets and nothings worked. Can anyone help with code snippets preferably?

Expected behaviour

For it to run as normal

Environment and installation:

Please give details about your installation:

eddiebergman commented 9 months ago

Hi @jordannelson0,

Have you tried with putting your code in a if __name__ == '__main__': block? This is required with using multiple processes on windows and nothing can be done about that

jordannelson0 commented 9 months ago

Hi @jordannelson0,

Have you tried with putting your code in a if __name__ == '__main__': block? This is required with using multiple processes on windows and nothing can be done about that

Im not ln windows

eddiebergman commented 9 months ago

Oh sorry, that initial error looks very much like it's a windows one, i.e. based on this:

This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:

By default, we use forkserver for spawning new processes, which is almost identical to using fork. This is the default on Linux, where as on Windows, it would have to use spawn, hence my inclination that you were using windows (sorry for not seeing the bottom part). And running a simple example, with all auto-sklearn's defaults, does that work?

jordannelson0 commented 9 months ago

Auto-SKL works fine using the datasets the API has integrated. But not with this dataset.

jordannelson0 commented 9 months ago

The dataset itself, while large is extremely clean. Using standard scikit learn/keras for example you can expect results close to 100% (accuracy metric), as a testament to the fidelity of the dataset. So despite its size, I don't consider the dataset an issue.

jordannelson0 commented 9 months ago

Using all defaults returns the same error(s)

eddiebergman commented 9 months ago

My best advice is see if you can subsample 100 rows or so and see if that causes the issues, still ... and if so, subsample down to 50 and so on...

If you can construct artificial data that causes this issue then maybe I can help, but otherwise it seems like it's dataset related. There's not much I can go off of based on what's provided.

This part of the traceback:

Traceback (most recent call last):
  File "/home/jordan/Documents/Brighton_University/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 961, in fit
    self._logger.exception(e)
AttributeError: 'NoneType' object has no attribute 'exception'

Is just due to the __del__ part of autosklearn and some odd choices of the logging system. However it seems to be caused by the first error.

Just to be clear, have you tried using the if __name__ == "__main__" block?

jordannelson0 commented 9 months ago

In regards to your last comment, I haven't tried. I'm sorry to admit I'm overloaded with other work atm (im doing a phd). If you have time, and are kind enough to provide me with some code samples to c+p and test, id be more than willing.

eddiebergman commented 9 months ago

I took your sample and added the small bit to take 100 samples. If you can provide the prints, that might help.

Don't worry, I also work in a research lab and understand it can be busy. Let me know when you can try it

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier

dataframe = read_csv("Spy.csv", skiprows=0)
dataset = dataframe.values

N_SAMPLES = 100
x = dataset[:N_SAMPLES, 0:9503]
y = dataset[:N_SAMPLES, 9503]
print(x, y)
print(x.dtype, y.dtype)

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# define search
model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
                              initial_configurations_via_metalearning=0,
                              memory_limit=2000,
                              time_left_for_this_task=10 * 60,
                              per_run_time_limit=60,
                              n_jobs=24)
# perform the search
model.fit(x_train, y_train)

# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)
jordannelson0 commented 9 months ago

Thanks, I'll try this tomorrow and get back to you. I'm in GMT timezone. For reference Thursday 4th Jan GMT.

jordannelson0 commented 9 months ago

`[[ 6 0 0 ... 0 0 0] [304 0 0 ... 0 0 0] [224 0 0 ... 0 0 0] ... [ 1 0 0 ... 0 0 0] [304 0 0 ... 0 0 0] [ 3 0 0 ... 0 0 0]] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] int64 int64 [[ 6 0 0 ... 0 0 0] [304 0 0 ... 0 0 0] [224 0 0 ... 0 0 0] ... [ 1 0 0 ... 0 0 0] [304 0 0 ... 0 0 0] [ 3 0 0 ... 0 0 0]] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] int64 int64 Traceback (most recent call last): File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 634, in fit self._logger = self._get_logger(dataset_name) File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 390, in _get_logger self.logging_server.start() File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/usr/lib/python3.10/multiprocessing/context.py", line 300, in _Popen return Popen(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in init super().init(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/usr/lib/python3.10/multiprocessing/popen_forkserver.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 961, in fit self._logger.exception(e) AttributeError: 'NoneType' object has no attribute 'exception'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/forkserver.py", line 274, in main code = _serve_one(child_r, fds, File "/usr/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one code = spawn._main(child_r, parent_sentinel) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/usr/lib/python3.10/runpy.py", line 289, in run_path return _run_module_code(code, init_globals, run_name, File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/jordan/Documents/Brighton_University/New_Idea/auto_scikit.py", line 26, in model.fit(x_train, y_train) File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 1448, in fit super().fit( File "/home/jordan/Documents/Brighton_University/NewIdea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 540, in fit self.automl.fit(load_models=self.load_models, **kwargs) File "/home/jordan/Documents/Brighton_University/PhD/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2304, in fit return super().fit( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 964, in fit self._fit_cleanup() File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 1064, in _fit_cleanup self._logger.info("Closing the dask infrastructure") AttributeError: 'NoneType' object has no attribute 'info' ` Hi, here's a the full output after trying to run with 100 sample

jordannelson0 commented 9 months ago

I ran this with 100, 50, 10 & 5 sample size. Same output each time

jordannelson0 commented 9 months ago

I also ran this with an alternate dataset which has the same datatypes & properties. A dataset with a label in the final column, both datasets are used for binary classification. Each dataset is from a cyber security background relating to malware on the android platform, each column represents a different permission an app does/doesn't have access to, 1 representing access to that permission, 0 the opposite. The final label column has the value of 1 or 0, 1 representing malicious application 0 representing non-malicious. I hope this provides some insight into the datasets I'm using

eddiebergman commented 9 months ago

And this? With the main guard included?

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier

if __name__ == "__main__":
    dataframe = read_csv("Spy.csv", skiprows=0)
    dataset = dataframe.values

    N_SAMPLES = 100
    x = dataset[:N_SAMPLES, 0:9503]
    y = dataset[:N_SAMPLES, 9503]
    print(x, y)
    print(x.dtype, y.dtype)

    # Split the dataset into training and testing sets
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

    # define search
    model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
                                  initial_configurations_via_metalearning=0,
                                  memory_limit=2000,
                                  time_left_for_this_task=10 * 60,
                                  per_run_time_limit=60,
                                  n_jobs=24)
    # perform the search
    model.fit(x_train, y_train)

    # summarize
    print(model.sprint_statistics())
    # evaluate best model
    y_hat = model.predict(x_test)
    acc = accuracy_score(y_test, y_hat)
    print("Accuracy: %.3f" % acc)
jordannelson0 commented 9 months ago

`[[ 6 0 0 ... 0 0 0] [304 0 0 ... 0 0 0] [224 0 0 ... 0 0 0] ... [ 1 0 0 ... 0 0 0] [304 0 0 ... 0 0 0] [ 3 0 0 ... 0 0 0]] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] int64 int64 [ERROR] [2024-01-04 15:29:19,768:Client-AutoML(1):016542bd-ab16-11ee-a0b9-0ddc104e5793] (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 2000 MB).', 'configuration_origin': 'DUMMY'}.",) [ERROR] [2024-01-04 15:29:19,768:Client-AutoML(1):016542bd-ab16-11ee-a0b9-0ddc104e5793] (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 2000 MB).', 'configuration_origin': 'DUMMY'}.",) Traceback (most recent call last): File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 765, in fit self._do_dummy_prediction() File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 489, in _do_dummy_prediction raise ValueError(msg) ValueError: (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 2000 MB).', 'configuration_origin': 'DUMMY'}.",) Traceback (most recent call last): File "/home/jordan/Documents/Brighton_University/New_Idea/auto_scikit.py", line 27, in model.fit(x_train, y_train) File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 1448, in fit super().fit( File "/home/jordan/Documents/Brighton_University/NewIdea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 540, in fit self.automl.fit(load_models=self.load_models, **kwargs) File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2304, in fit return super().fit( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 962, in fit raise e File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 765, in fit self._do_dummy_prediction() File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 489, in _do_dummy_prediction raise ValueError(msg) ValueError: (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 2000 MB).', 'configuration_origin': 'DUMMY'}.",)

Process finished with exit code 1 `

eddiebergman commented 9 months ago

Okay, so that's a lot more helpful of an error. My guess is that since you have 9000+ features and they are all integers, autosklearn is trying to one-hot encode them. This effectively adds X new columns per column, where X is the number of unique integer values in that column. Multiply that by ~9000 and it's likely the dataset size explodes.

Estimators like a hist gradient boosting classifiers do not really care about one hot encoded variables while something like an MLP will. The only thing I could suggest is to try disable "data_preprocess" with the exclude parameter since your data is already pretty clean. If you need to do some data preprocessing, then I would suggest doing it manually before AutoSklearn.

https://github.com/automl/auto-sklearn/blob/673211252ca508b6f5bb92cf5fa87c6455bbad99/autosklearn/estimators.py#L180-L190

Maybe another alternative is to convert the data into float dtypes, as then autosklearn wont try to one-hot encode them, but I do not know your data and whether these values represent categoricals.

jordannelson0 commented 9 months ago

`from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from autosklearn.classification import AutoSklearnClassifier

if name == "main": dataframe = read_csv("Spy.csv", skiprows=0) dataset = dataframe.values

N_SAMPLES = 100
x = dataset[:N_SAMPLES, 0:9503]
y = dataset[:N_SAMPLES, 9503]
print(x, y)
print(x.dtype, y.dtype)

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# define search
model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
                              initial_configurations_via_metalearning=0,
                              memory_limit=2000,
                              time_left_for_this_task=10 * 60,
                              per_run_time_limit=60,
                              n_jobs=24,
                              exclude={
                                  'data_preprocessor': ['feature_type']
                              })
# perform the search
model.fit(x_train, y_train)

# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

And got: [ERROR] [2024-01-04 16:52:46,496:Client-AutoML(1):aef2d3dd-ab21-11ee-adc1-0ddc104e5793] No valid pipeline found. Traceback (most recent call last): File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 751, in fit self.configuration_space, configspace_path = self._create_search_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2252, in _create_search_space configuration_space = pipeline.get_configuration_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/util/pipeline.py", line 53, in get_configuration_space return _get_classification_configuration_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/util/pipeline.py", line 155, in _get_classification_configuration_space return SimpleClassificationPipeline( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/classification.py", line 88, in init super().init( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 66, in init self.config_space = self.get_hyperparameter_search_space(feat_type=feat_type) File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 276, in get_hyperparameter_search_space self.config_space = self._get_hyperparameter_search_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/classification.py", line 206, in _get_hyperparameter_search_space cs = self._get_base_search_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 384, in _get_base_search_space assert np.sum(matches) != 0, "No valid pipeline found." AssertionError: No valid pipeline found. Traceback (most recent call last): File "/home/jordan/Documents/Brighton_University/New_Idea/auto_scikit.py", line 30, in model.fit(x_train, y_train) File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 1448, in fit super().fit( File "/home/jordan/Documents/Brighton_University/NewIdea/venv/lib/python3.10/site-packages/autosklearn/estimators.py", line 540, in fit self.automl.fit(load_models=self.load_models, **kwargs) File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2304, in fit return super().fit( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 962, in fit raise e File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 751, in fit self.configuration_space, configspace_path = self._create_search_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/automl.py", line 2252, in _create_search_space configuration_space = pipeline.get_configuration_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/util/pipeline.py", line 53, in get_configuration_space return _get_classification_configuration_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/util/pipeline.py", line 155, in _get_classification_configuration_space return SimpleClassificationPipeline( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/classification.py", line 88, in init super().init( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 66, in init self.config_space = self.get_hyperparameter_search_space(feat_type=feat_type) File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 276, in get_hyperparameter_search_space self.config_space = self._get_hyperparameter_search_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/classification.py", line 206, in _get_hyperparameter_search_space cs = self._get_base_search_space( File "/home/jordan/Documents/Brighton_University/New_Idea/venv/lib/python3.10/site-packages/autosklearn/pipeline/base.py", line 384, in _get_base_search_space assert np.sum(matches) != 0, "No valid pipeline found." AssertionError: No valid pipeline found.`

eddiebergman commented 9 months ago

Hmmm sorry, I wished that would have worked, you'll likely have to try this example then: https://automl.github.io/auto-sklearn/master/examples/80_extending/example_extending_data_preprocessor.html#sphx-glr-examples-80-extending-example-extending-data-preprocessor-py

jordannelson0 commented 8 months ago

I tried, the memory issue persisted unfortunately

jordannelson0 commented 8 months ago

from typing import Optional from pprint import pprint

import autosklearn.classification import autosklearn.pipeline.components.data_preprocessing import sklearn.metrics from ConfigSpace.configuration_space import ConfigurationSpace

from autosklearn.askl_typing import FEAT_TYPE_TYPE from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm from autosklearn.pipeline.constants import SPARSE, DENSE, UNSIGNED_DATA, INPUT from pandas import read_csv from sklearn.datasets import load_breast_cancer from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split

class NoPreprocessing(AutoSklearnPreprocessingAlgorithm): def init(self, **kwargs): """This preprocessors does not change the data"""

Some internal checks makes sure parameters are set

    for key, val in kwargs.items():
        setattr(self, key, val)

def fit(self, X, Y=None):
    return self

def transform(self, X):
    return X

@staticmethod
def get_properties(dataset_properties=None):
    return {
        "shortname": "NoPreprocessing",
        "name": "NoPreprocessing",
        "handles_regression": True,
        "handles_classification": True,
        "handles_multiclass": True,
        "handles_multilabel": True,
        "handles_multioutput": True,
        "is_deterministic": True,
        "input": (SPARSE, DENSE, UNSIGNED_DATA),
        "output": (INPUT,),
    }

@staticmethod
def get_hyperparameter_search_space(
    feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
):
    return ConfigurationSpace()  # Return an empty configuration as there is None

Add NoPreprocessing component to auto-sklearn.

autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)

dataframe = read_csv("adware1.csv", skiprows=0) dataset = dataframe.values

N_SAMPLES = 100 x = dataset[:, 0:440] y = dataset[:, 440] print(x, y) print(x.dtype, y.dtype)

Split the dataset into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

clf = autosklearn.classification.AutoSklearnClassifier( ensemble_kwargs={'ensemble_size': 0}, time_left_for_this_task=30*60, include={"data_preprocessor": ["NoPreprocessing"]},

Bellow two flags are provided to speed up calculations

# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
per_run_time_limit=60,

) clf.fit(x_train, y_train)

To check that models were found without issue when running examples

assert len(clf.get_models_with_weights()) > 0 print(clf.sprint_statistics())

summarize

print(clf.sprint_statistics())

evaluate best model

y_hat = clf.predict(x_test) acc = accuracy_score(y_test, y_hat) print("Accuracy: %.3f" % acc) pprint(clf.show_models())

I do have this example working with a different smaller dataset.