automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.5k stars 1.27k forks source link

convert to scikit learn code. #388

Open palapalamao opened 6 years ago

palapalamao commented 6 years ago

[(0.666667, SimpleRegressionPipeline({'imputation:strategy': 'mean', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:choice': 'no_preprocessing', 'regressor:choice': 'adaboost', 'rescaling:choice': 'minmax', 'one_hot_encoding:minimum_fraction': 0.010000000000000004, 'regressor:adaboost:learning_rate': 0.9890631979261445, 'regressor:adaboost:loss': 'linear', 'regressor:adaboost:max_depth': 10, 'regressor:adaboost:n_estimators': 127}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), (0.333333, SimpleRegressionPipeline({'imputation:strategy': 'mean', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:choice': 'random_trees_embedding', 'regressor:choice': 'liblinear_svr', 'rescaling:choice': 'standardize', 'one_hot_encoding:minimum_fraction': 0.00011808426850838513, 'preprocessor:random_trees_embedding:max_depth': 3, 'preprocessor:random_trees_embedding:max_leaf_nodes': 'None', 'preprocessor:random_trees_embedding:min_samples_leaf': 3, 'preprocessor:random_trees_embedding:min_samples_split': 3, 'preprocessor:random_trees_embedding:min_weight_fraction_leaf': 1.0, 'preprocessor:random_trees_embedding:n_estimators': 68, 'regressor:liblinear_svr:C': 1.4174149191248073, 'regressor:liblinear_svr:dual': 'False', 'regressor:liblinear_svr:epsilon': 0.0328370684051209, 'regressor:liblinear_svr:fit_intercept': 'True', 'regressor:liblinear_svr:intercept_scaling': 1, 'regressor:liblinear_svr:loss': 'squared_epsilon_insensitive', 'regressor:liblinear_svr:tol': 0.0012221149693867595}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), ] R2 score: 0.87227602958 How to convert the model I run to sklearn code?could you give me some example code?

mfeurer commented 6 years ago

Have a look at https://github.com/automl/auto-sklearn/issues/30. It would actually be great if the returned ensemble would be a pure scikit-learn model. Not sure how to achieve this, though.

activaigor commented 5 years ago

@mfeurer is it still relevant? can i get more information about what is required here?

mfeurer commented 5 years ago

I think it would still be great to have this feature. Basically, the final model/ensemble needs to be converted to a pure scikit-learn code. Similarly to show_models() this would print a representation of the models found by Auto-sklearn, but one that could be pasted into python to instantiate standalone scikit-learn code.

How familiar are you with Auto-sklearn?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.

GeorgePearse commented 2 years ago

@mfeurer I'd love to work on this if it's still considered beneficial.

eddiebergman commented 2 years ago

Hi @GeorgePearse,

Having something like TPOT's export is something I think me and @mfeurer had discussed a while ago but feel free to correct me if I'm wrong there @mfeurer.

There used to be more activity around this feature request from what I remember but I'm sure people would find this feature useful in scenarios where they would like to strip away auto-sklearn for the end model used in production cycles, or to simply play around.

We have a PR (#1321) by another user at the moment that gives access to the underlying models by updating the show_models() function to return a dict mapping from keys to different pipeline steps and the models. This could be used as a basis for accessing all the different components of the end optimization result, where the models would have to be extracted from our wrapper components and then have their hyper parameters filled in correctly.

I'm not entirely sure what is the best process to convert this into a pure Python scikit learn script but having access to the Config of ConfigSpace that generated the models would be hugely beneficial as that is how we instansiate them. These configs can also be printed out as a dict which could make setting up model creation quite easy. I figure this is the most difficult step and the one people would want automated the most, instansiating the models with the hyperparameters we found to be best.

I will get back to you if I can think of any other helpful pointer but I would be happy to help out and discuss for getting this feature in :)

GeorgePearse commented 2 years ago

Cheers @eddiebergman it's an interesting problem, just looking through TPOT's implementation now and will start digging into the internals of this repo in a second.

GeorgePearse commented 2 years ago

As a warning I'm unlikely to give this a real crack until the 26th Dec 2021 and beyond. If anyone can give it a go before then by all means go for it.

eddiebergman commented 2 years ago

No problem, you probably won't get much feedback until mid January in that case, feel free to work on it before then if you'd like but there's no rush, thanks for the contribution offer :)

mfeurer commented 2 years ago

I once had a look into this feature, but coded it as a standalone function. Instead it should be built into Auto-sklearn, for example, one should be able to do classifier.export_to_sklearn() and get a scikit-learn-only model. Nevertheless, here's the code for reference:

import os
import pickle
import types

import numpy as np
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing

import autosklearn.estimators
import autosklearn.pipeline.base
import autosklearn.pipeline.components.base
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing
import autosklearn.pipeline.components.data_preprocessing.balancing.balancing
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing_numerical
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing_categorical

bunch = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True)
y = bunch['target'].to_numpy()
X = bunch['data'].to_numpy(np.float)

X_train, X_test, y_train, y_test = \
     sklearn.model_selection.train_test_split(X, y, random_state=1)
feat_type = ['Categorical' if x.name == 'category' else 'Numerical' for x in bunch['data'].dtypes]

pickle_name = 'model.pkl'
if not os.path.exists(pickle_name):
    cls = autosklearn.estimators.AutoSklearnClassifier(time_left_for_this_task=60)
    cls.fit(X_train, y_train, feat_type=feat_type)
    with open(pickle_name, 'wb') as fh:
        pickle.dump(cls, fh)
else:
    with open(pickle_name, 'rb') as fh:
        cls = pickle.load(fh)

askl_ensemble = sklearn.ensemble.VotingClassifier(estimators=None, voting='soft')
weights = []
models = []
for weight, identifier in zip(list(cls.automl_.ensemble_.weights_),
                              list(cls.automl_.ensemble_.identifiers_)):
    if weight == 0.0:
        continue
    weights.append(weight)
    try:
        models.append(cls.automl_.models_[identifier])
    except KeyError:
        print(cls.automl_.ensemble_)
        print(cls.automl_.ensemble_.identifiers_)
        print(cls.automl_.models_)
        raise

askl_ensemble.estimators = models
askl_ensemble.estimators_ = models
askl_ensemble.weights = weights
askl_ensemble.le_ = sklearn.preprocessing.LabelEncoder().fit(y_train)
askl_ensemble.classes_ = askl_ensemble.le_.classes_

#print(askl_ensemble.predict(X_test))
#print(cls.predict(X_test))

#print(askl_ensemble.__repr__(N_CHAR_MAX=100000))

def extract_sklearn_object(obj):
    if isinstance(obj, sklearn.ensemble.VotingClassifier):
        estimators = [extract_sklearn_object(estimator) for estimator in obj.estimators_]
        obj.estimators = estimators
        obj.estimators_ = estimators
        return obj
    elif isinstance(obj, autosklearn.pipeline.base.BasePipeline):
        steps = []
        for name, step in obj.steps:
            steps.append((name, extract_sklearn_object(step)))
        return sklearn.pipeline.Pipeline(
            steps=steps,
            memory=obj.memory,
            verbose=obj.verbose,
        )
    elif isinstance(obj, autosklearn.pipeline.components.data_preprocessing.data_preprocessing.DataPreprocessor):
        # TODO Make the auto-sklearn object an actual column transformer or make it a learnable
        #  attribute column_transformer_
        column_transformer = obj.column_transformer
        transformers = []
        for name, trans, column in column_transformer.transformers_:
            transformers.append((name, extract_sklearn_object(trans), column))
        column_transformer = sklearn.compose.ColumnTransformer(
            transformers=transformers,
            remainder=column_transformer.remainder,
            sparse_threshold=column_transformer.sparse_threshold,
            n_jobs=column_transformer.n_jobs,
            transformer_weights=column_transformer.transformer_weights,
            verbose=column_transformer.verbose,
        )
        return column_transformer
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnChoice):
        # TODO make choice a fit-recognizing attribute: obj.choice_
        return extract_sklearn_object(obj.choice)
    elif isinstance(obj, autosklearn.pipeline.components.data_preprocessing.balancing.balancing.Balancing):
        # TODO implement the actual behavior of weighting!!!
        return 'passthrough'
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm):
        # TODO make preprocessor preprocessor_
        return obj.preprocessor
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnClassificationAlgorithm):
        return obj.estimator
    else:
        raise TypeError(type(obj))

def verify_only_sklearn_objects(obj):
    if (
        obj is None
        or isinstance(obj, (int, float, str))
        or isinstance(obj, types.FunctionType)
        or isinstance(obj, (np.random.RandomState, np.int32, np.int64, np.uint32, np.uint64,
                            np.void, np.float64, np.bool_))
        or obj in (np.float64, np.bool_)
    ):
        return
    elif isinstance(obj, (list, tuple, np.ndarray, set)):
        pass
    elif obj.__class__.__module__.startswith('sklearn.'):
        pass
    elif obj.__class__.__module__.startswith('autosklearn.pipeline.implementations.'):
        pass
    else:
        raise TypeError((type(obj), obj))

    if hasattr(obj, '__dict__'):
        for key in vars(obj):
            verify_only_sklearn_objects(vars(obj)[key])
    elif isinstance(obj, (list, tuple, np.ndarray, set)):
        for entry in obj:
            verify_only_sklearn_objects(entry)
    elif obj.__class__.__module__.startswith('sklearn.'):
        # These are private sklearn objects
        pass
    else:
        raise TypeError((type(obj), obj))

# TODO what about the stuff from validation.py that's done prior to fitting?
# TODO add necessary imports! - also add the full class names
# TODO what about the random states? Set them as integers in auto-sklearn to be reproducible?
# TODO Improve the printing to be more readable
# TODO add a few tests that the export is done correctly
extracted_model = extract_sklearn_object(askl_ensemble)
verify_only_sklearn_objects(extracted_model)
print(extracted_model.__repr__(N_CHAR_MAX=1000000))

Most importantly, I think every component should by itself know how to convert itself to a scikit-learn object instead of having all this information in a central function as shown here.

mfeurer commented 2 years ago

Hey @GeorgePearse did you already get started? If not, I could have a look on Wednesday morning.

GeorgePearse commented 2 years ago

Hi @mfeurer sorry for the radio silence. You go for it, didn't really get anywhere. Looking forward to seeing the implementation though.

mfeurer commented 2 years ago

No worries, there's now a draft in #1375.

mereldawu commented 2 years ago

Hi @mfeurer, this feature is quite useful for us as we'd like to ultimately use kserve to serve the autosklearn models. I took a look at the draft, will only the best model be considered? Or will there be a way to export the other models found during the trial?

mfeurer commented 2 years ago

Yes and no. This will add a functionality to the class AutoSklearnClassifier that will only export the models that are part of the ensemble. But also, this will add an export function to each individual model. As @eddiebergman pointed out in #1376 he is working on a function to easily access all models stored on disk, so it will be possible.

roch-gla commented 2 years ago

Hi @mfeurer, thank you for working on this issue! This will allow to leverage the power of autosklearn in production pipelines that are implemented for sklearn-pipelines. Looking forward to using this feature.

xieleo5 commented 1 year ago

Hi, @mfeurer , how can we use the function to_sklearn() in the latest version? I can't find this function inside AutoSklearnClassifier now.

kunjshukla commented 1 year ago

Hey, I would like to contribute to this issue. Please assign this to me.

DPRASAD-dp commented 1 day ago

is someone working on this or else i would like to contribute