What's in store for Auto-Sklearn? -- From the Developers

eddiebergman commented 1 year ago

What's going on?

Auto-Sklearn has recently been under-maintained, we appreciate that this has caused many users to face dependency issues as pinned dependencies slowly start going out of data. While we support this project primarily through academic means, we are still proud of the community that has formed around it and are dedicated to push it forward.

Will Auto-Sklearn still be maintained?

Yes, auto-sklearn will be maintained and updated moving forward! We initially tried some of these updates, e.g. #1611, #1618 but there were larger issues at play. To alleviate this, we are currently working on a major refactor of the tool, introducing more flexibility and long-wanted features, including pipeline export, flexible pipelines, and a modular design. We expect the first prototype will be available within the next 1-2 months.

Why the refactor?

Auto-Sklearn was initially built during Python 2 and during the eariler days of scikit-learn. Machine learning libraries and their eco-system were still developing and a lot has changed since then. There were also a lot of lessons learned which while easy in concept, truly difficult to integrate into the current design.

Doing research with Auto-Sklearn has also become harder. By becoming a robust and well-performing tool, this has made performing novel research with Auto-Sklearn more difficult.

What to expect?

... Not that much, it's a refactor to get back to where we were but with the goal to make it more extensible.

We will still maintain the front facing AutoSklearnClassifier and AutoSklearnRegressor, to act primarily as it did before and staying very scikit-learn like with it's simple interface.

This refactor will allow us to solve some long standing issues that have arose. We looked through all the issues and tried to categorize what this new refactor will enable. Not all of these issues will be solved upon release but they will provide a tangible rode towards these.

We will have a new flexible scheduling system, allowing users to hook into events as they happen, hopefully handling issues like:
- 1236
- 1624
- 1569
- 1522
- 986
- 397
A more flexible pipeline definition, allowing you to create your own or just modify the default, solving:
- 1661
- 578
- 1110
- 1587
- 1548
- 1429
- 1268
- 577
- 1150
- 379
Auto-Sklearn will allow you to optimize your own custom sklearn pipelines and try it's darn best to return you pure functioning sklearn pipelines (no auto-sklearn custom parts attached). This means you will be able to run any library that supports sklearn pipelines. This should allow great strides towards:
- 388
- 1006
- 1633
- 1667
- 1663
- 1641
- 1272
- 1640
- 1634
- 1607
- 1597
- 1600
- 1467
- 1102
- 1448
- 786
By refactoring, we can also use newer features of sklearn, that previously we tried to bolt in, but was never a first class citizen at the time of Auto-Sklearn's conception.
- 288
- 1615
- 1614
- 1613
- 1596
- 1494
- 1334

What can I do?

Please let us know what you think and what you'd like to see from this rebuild!

AmirAlavi commented 1 year ago

I think that such a refactor would be very beneficial, and make auto-sklearn even more useful. Are there opportunities to contribute to this refactor?

eddiebergman commented 1 year ago

Hi @AmirAlavi, unfortunately not at this time but once we publish there will be definitely be a lot of opportunities to contribute! Thanks very much for offering, we appreciate it :)

manuelinfosec commented 1 year ago

Hi, @eddiebergman. I and my team are really looking forward to AutoML integration to ONNX formats. We've been really excited working with AutoML do far, and can wait to take interoperability to another level.

eddiebergman commented 1 year ago

Hello, brief update:

We're in the process of doing some benchmarking, namely to ensure we can still handle most datasets. Progress after this will be to re-enable some more performance features of the original auto-sklearn, namely iterative fitting, meta-learning and AutoSklearn2.

@manuelinfosec I am unfamiliar with ONNX, would you mind be able to help me with a few questions?

Does ONNX work with full sklearn pipelines that are trained on dataframes and output dataframes?
- Notably the input given to pipelines will now be pandas and we use pipeline.set_output("pandas") which enables pandas for intermediate and final output. The ONNX tutorial seems to suggest numerical dtypes as input, however any encoders (e.g. OrdinalEncoder) would expect a "str" or "categorical" dtype as they were trained on. Is this still possible or does ONNX have some strict restrictions on the setup of how models must be trained?
I see VotingClassifier and VotingRegressor is supported, that's how the final ensemble is returned to the user (new and old). However we have to use a hack to make this work, namely setting the private estimators_ attribute scikit-learn/scikit-learn#7382 and some other necessities. Would this cause an issue with ONNX?
I see not all estimators/transformers are supported. Does this selection also extend to estimator parameters? Is there any automatic way within ONNX to automatically know if a model is supported?

eddiebergman commented 1 year ago

Update on ONNX:

I spent some time with sklearn-onnx to convert some of the pure-sklearn pipelines that auto-sklearn will output (no autosklearn code in there). Most of these pipeline will essentially be a VotingClassifier or VotingRegressor with several Pipeline's inside them. This is how we can handle ensembles and CV fold models.

However this doesn't seem to be supported by sklearn-onnx as documented in onnx/sklearn-onnx#1016. As the VotingX with multiple datatypes is our main component for combining estimators, I am not sure how we can make ONNX compatible pipelines.

If anyone has more information on something I could be missing, please let me know either in the above mentioned issue or here!

AmirAlavi commented 1 year ago

@eddiebergman something I'd like to see from this refactor is the ability to specify an optional max_success_trials parameter (maximum number of successful candidate algorithms tried) to the constructors for AutoSklearn estimators for early stopping. I don't think this is currently possible from the get_trials_callback parameter, since the result: RunValue argument of that callback only lists num_run in the additional_info, which I think is the counter of total candidates tried, whether they succeeded or not.

Or rather than adding that special param to the constructor, giving the aforementioned callback hook access to the runhistory?

AmirAlavi commented 1 year ago

@eddiebergman something else we noticed was the extremely long refit time for AutoSklearn models. For example, if we set the time_left_for_this_task to be just 1hr, and so the per_run_time_limit is just 6minutes, we noticed that the refit could somehow take hours in some cases.

We didn't investigate it thoroughly, but I think I recall discovering that the HistGradientBoosting models would take a very long time to fit. I know ensembling adds another layer to this, but I think we had observed this even for ensemble_size=1.

I'm wondering if you had also noticed any performance issues with that, and if the new updates address it? (perhaps the upgrade to newer sklearn takes care of it)

eddiebergman commented 1 year ago

Note: If you appreciated this longer form insight into the progress and new design, please give an emoji response, otherwise we can just stick to short form responses :) Please feel free to ask about other topics and I can write up a response of the new underlying API and how things will work.

Hi @AmirAlavi,

Regarding point 1. Callbacks are how the new autosklearn is mostly built, i.e. that's how most of the control flow now works (with some additional extra features ;) ). However to keep things simple for the end user, the AutoSklearnEstimator is following the sklearn api, where most of the functionality happens in fit(...). There is no exposure to these callbacks and customization that doesn't fit naively into init args are mostly avoided. Worth restating, the estimator API will remain mostly identical, along with it's feature set

To this end, we have a very simple argument n_trials= to the Estimator classes which stops after a certain number of trials are done (not necessarily successfully).

However this fit(...) call creates an AutoSklearn object which is more traditional in it's setup and usage. If you'd like full control, then you'd have to create one of these AutoSklearn objects, following the Estimator as a guide for how to do so.

Tasks and Plugins

To give you some idea of how this works and how we tried to increase the surface area of interaction, we now follow a more "server" like control flow, i.e. event-driven. The bit I'll share for today is namely the notion of a Scheduler, Task and a TaskPlugin.

Scheduler: A event-loop based controller using async/await to prevent the pit-falls of while True event loop. You do not need to know Python's async to use it, but the functionality is there for advanced use cases (i.e. as a server). Can use anything that looks like a python Executor to submit compute and return a Future.
- ProcessPoolExecutor (python's native local processes)
- Dask
- Loky
Task: A unit of compute to run on some Scheduler with associated events. Can be extended for more context-dependent events.
- Evaluating Trials
- Ensembling
- Pruning non-performant trials
TaskPlugin: Extends/modifies the runtime of a Task (parameters/setup/etc...)
- Pynisher to limit resource usage
- ThreadpoolCtl limit oversubscription to threads in sklearn.
- CallLimiter control if a task gets submitted by call count, concurrent count or conflicting tasks (e.g. don't prune models from disk while ensembling)

With this context, here is 2 examples of how you could accomplish your goal of stopping after a number of successful trials, the first focusing on the notion of a Task, while the second on TaskPlugins, let me know if you have feedback!

askl = AutoSklearn(...)
task = askl.trial_task

# Option 1

@task.on_success(when=lambda: task.on_success.count >= 10)  # optional `when=`
def stop_autosklearn(report: Trial.Report) -> None:
    askl.stop()

# Option 2
task.on_success(
    stop_autosklearn,
    when=lambda: task.on_success.count >= 10,
)

# This `.count` property doesn't exist but I'll add it, thanks for the illuminating question!
# has other events such as `on_{cancelled/failed/crashed/memout/timeout/...}

askl.run(...)

There are also "plugins" which modify behaviors of Tasks, the use-case agnostic unit of compute. This example goes a bit deeper, trying to show both the events system, using the Emitter and also the usage plugin

from typing import ParamSpec, TypeVar
from datetime import datetime
from ... import TaskPlugin, Task, Emitter, Event

P = ParamSpec("P")
R = TypeVar("R")

class MyPlugin(Emitter, TaskPlugin):
    """A TaskPlugin interacts with a submission of a Task before
    it hits the Scheduler. This can be used to modify the function or its
    arguments, as well as getting the highest priority in terms of responding
    to events emitted from the task.
    """

    name = "my-plugin"
    """Name of the plugin for logging purposes"""

    SUCCESS_LIMIT_REACHED: Event[str] = Event("an-event-name")
    """Will emit the current time when the limit is reached"""

    def __init__(self, n: int) -> None:
        super().__init__()
        self.n = n
        self.count = 0

        # These are how the callbacks are created
        # by using the `Event` defined above, it enables type-safety
        # for anyone using the `.on_reached` attribute to register
        # callbacks
        self.on_reached = self.subscriber(self.SUCCESS_LIMIT_REACHED)

        # The typing is like a `Callable` if familiar (not required)
        self.task: Task[..., Trial.Report] | None = None

    def pre_submit(
        self,
        fn: Callable[P, R],
        *args: P.args,
        **kwargs: P.kwargs,
    ) -> tuple[Callable[P, R], tuple, dict] | None:
        # TaskPlugin can modify the function and args before
        # it's submitted to the Scheduler
        if self.count >= self.n:
            return None  # Scheduler will not submit anything

        return fn, args, kwargs  # Submit as normal

    def attach_task(self, task: Task[..., Trial.Report]) -> None:
        self.task = task
        task.on_returned(self._check_to_stop) # This `.on_returned` created the same way as above

    def _check_to_stop(self, report: Trial.Report):
        assert self.task is not None

        if report.status is Trial.Status.Success:
            # Or just if report.status == "success"          
            self.count += 1
            if self.count >= self.n:
                 time_stamp = datetime.now().isoformat()
                 self.on_reached.emit(time_stamp)

myplugin = MyPlugin()

# Only now are we really in the context of AutoSklearn
askl = AutoSklearn(..., plugins=[myplugin])

@myplugin.on_reached
def stop_askl(timestamp: str) -> None:
    askl.stop()
    print(f"askl_stopped at {timestamp} after {myplugin.n} successes")

These plugins are how a lot of the additional, optional, functionality is given to autosklearn tasks, such as memory limiting and call limiting.

Sorry for the long response to what's a rather simple question, I wanted to share a little bit of how the internals work so that you can give any feedback or raise any other questions. I unfortunately can still not share any source code.

Regarding question 2. I don't think we ever noticed this so thanks for bringing it to our attention. Could you raise a separate issue so that we have a note to investigate this? Unfortunately it's a little more complicated than just an ensemble. AutoSklearn works by ensembling all fold models from cv-folds and then also creates a weighted ensemble on top this. For example, with 5 fold CV and an ensemble which contains 3 models, this will be the weighted probabilities of 3 ensembles of 5 models, i.e. all 15 train sklearn models will be used. This still wouldn't fully explain the fitting times but just for some extra info.

AmirAlavi commented 1 year ago

@eddiebergman Thank you for the longform response! It was very useful and informative.

I like the new API from what it sounds like (though I don't think i have my head fully wrapped around it, since i'm not familiar with python's concurrent package). To confirm/summarize:

If I really like the built-in, opinionated pipeline that AutoSklearn uses, and I really like the SMAC algorithm and greedy ensemble building, I'll still be able to create an AutoSklearnEstimator as before
If I want to dig in more, I can follow the example of AutoSklearnEstimator (or probably some tutorial/user guide) to interact with AutoSklearn directly. So now, for full control, I won't have to create my own Estimator classes and copy a lot of boilerplate.

Will we be able to specify the bayesian optimization algorithm as well? for example, what if rather than SMAC (which I believe uses a regression tree as the posterior probability model, and also flips a coin every time to determine whether to listen to the posterior model or totally randomly explore), I'd like to try another algorithm? Will the new API have a "registry" of such algorithms already defined? or maybe define the API under which we can add our own?

AmirAlavi commented 1 year ago

@eddiebergman Regarding point two, sure I'll make an issue so there's a record of it, but not sure i'll get to a thorough investigation myself.

And thank you! I wasn't aware of that detail of using all models from the CVs in the ensembling, and that could certainly partially explain why even when setting ensemble size=1, the refit would still take long (since i was using 10-fold CV)! That's really good to know going forward.

AmirAlavi commented 1 year ago

@eddiebergman I have another feature request/consideration for you :)

A pain point currently is making sure we're doing a proper cross-validation scheme.

Because we're doing algorithm selection and hyperparameter tuning (querying many candidate models), I think we need to do something like nested CV. The current cv argument is thus the "inner cv". But that leaves the responsibility of the "outer cv" to the user. So the straighforward solution of wrapping your autoML script in a loop is cumbersome and seems inneficient.

Concretely, you might do this to start with:

X_train, X_test, y_train, y_test = train_test_split(X, y)

cv = StratifiedKFold(n_splits=5)
model = AutoSklearnClassifier(..., cv=cv, ...)
# "model selection" and "hyperparameter tuning" done here,
# but askl will do cv to do this selection, and so no model will
# see the whole X_train
model.fit(X_train, y_train)

# now that we've fixed our model, fit it on the whole train set
model.refit(X_train, y_train)
y_pred = model.predict(X_test)
log_to_experiment_tracker(accuracy_score(y_test, y_pred))

But you'll quickly realize you'll want to do an outer loop of CV rather than just a single train test split, so that you can get a distribution to estimate your generalization error, rather than just a point estimate.

One could call this "outer cv" splitter yourself outside the context of askl, but that would require you calling it in an expensive, slow, for loop; you could parallelize this, but you're on your own for that logic.

Can we make it so that askl can do the outer loop for you as well? Since it's already taking care of running jobs asynchronously and dispatching them to some compute backend.

Aditi840 commented 1 year ago

Hello, is there something here I can contribute to

AmirAlavi commented 1 year ago

@eddiebergman I have another feature request/consideration for you :)

I have a counterpoint to my own suggestion.

In the case when you have fixed resources, you have two options:

Run 5 automl jobs in parallel, each with a fraction of the resources
Run 5 automl jobs serially, each getting all of the resources exclusively

With (1), for each run to try N total algorithms, you may have to run them longer With (2), you can increase n_jobs and get through N total algorithms in shorter time.

Concretely, if machine has 10 cores, 100GB, to do 5 fold outer CV:

Run each in parallel, give each 2 cores, and 10gb memory per core (so each uses 20Gb), for 10hours (total runtime 10hrs)
Run each separately, give each 10 cores, and 10gb memory per core (each uses 100Gb memory), for 2 hrs each (total runtime 10hrs)

The total time is the same between these two. @eddiebergman is it correct to say that each automl job would also see the same number of algorithms (roughly)?

In other words: if you have fixed resources, one could argue that the outer CV loop not being parellelized is ok.

If you have scalable resources, then there's a case for the feature request under consideration

eddiebergman commented 1 year ago

Hi @AmirAlavi,

If I really like the built-in, opinionated pipeline that AutoSklearn uses, and I really like the SMAC algorithm and greedy ensemble building, I'll still be able to create an AutoSklearnEstimator as before

Yup, the estimator is still the opinionated fit() and predict() with some flexibility but not so much. The AutoSklearn class is less opinionated but vastly more flexible if you want to do HPO around sklearn.

Will we be able to specify the bayesian optimization algorithm as well?

The actual SMAC related code in AutoSklearn has now been reduced to be about 10 lines :) While theoretically other optimizers are possible, the main limitation is whether an optimizer supports a space with conditional hyperparameters. It's possible to use other optimizers which don't take this into account but this would lead to the optimizer getting confused if say for example, it chose the model to evaluate as an SVM and RandomForest::n_estimators=10. It does not know that it should not select this hyperparameter, that it has no impact on the results and then learn some faulty relationships between algorithm = SVM and RandomForest::n_estimators. SMAC is one of the few optimizers that handles this natively with a "static search space". See Optuna's define-by-run for an example of a dynamic search space. The notable difference here is you need custom code that defines your search space, something we can not have while allowing flexible pipeline structures.

Extra details we can share

The reason we need a "static search space" is that you can statically define your pipelines. This is what will enable the new autosklearn to optimize your own sklearn pipelines, not just our opinionated ones. Of course by default, it will use our opinionated one.

from ... import Pipeline, step, choice
from ConfigSpace import Float
from sklearn.pipeline import Pipeline as SklearnPipeline

pipeline = Pipeline.create(
    step("imputer", SimpleImputer, space={"strategy": ["mean", "median"]),
    choice(
        "estimator",
        step("rf", RandomForestClassifier, space={"n_estimators": (1, 10)}),
        step("svc", SVC, space={"C": Float("C" (0.0, 10.0), log=True)})
    )
)

# Pass in something that parses out a static space with `parser=...`
# or it will automatically try to find a suitable one.
# In this case, only the ConfigSpaceParser will now how to deal
# with the `ConfigSpace.Float` parameter in the space and you
# will get back a ConfigSpace
space = pipeline.space() 

# Likewise for `sample=...`. This will automatically use the `ConfigSpaceSampler` as `isisntance(space, ConfigSpace)`
config = pipeline.sample(space)

askl = AutoSklearnClassifier(pipeline=pipeline)
askl.fit(x, y)

# Note here that it gives back a pure sklearn object, no autosklearn
# objects are inside
best: SklearnPipeline = askl.best_

But you'll quickly realize you'll want to do an outer loop of CV rather than just a single train test split, so that you can get a distribution to estimate your generalization error, rather than just a point estimate.

Yup that's a fair point. While having this train-val-test split is usually sufficient, it could definitely be an issue in highly imbalanced datasets. I'm not sure we can handle this outer-cv ourselves very naively and would add a lot of complication. I do not think we will support this.

The total time is the same between these two. @eddiebergman is it correct to say that each automl job would also see the same number of algorithms (roughly)?

Roughly yes, that's correct but I have no concrete numbers to give you there.

If you have scalable resources, then there's a case for the feature request under consideration

In terms of enabling this functionality with scalable resources, we do have something like this now, where you can create a 10 core Scheduler but have autosklearn only use 2 cores of the scheduler.

from ... import Scheduler
from autosklearn import AutoSklearnClassifier

total_cores = 10
cores_per_askl = 2
scheduler = Scheduler.with_processes(total_cores)

xs, ys = ...
askls = [
    AutoSklearnClassifier(..., scheduler=scheduler, njobs=cores_per_askl) for _ in range(5)
]

for (x, y, askl) in zip(xs, ys, askls):
    askl.fit(x, y)

While not fully tested, we should also give capabilities to run on other kinds of resources:

from ... import Scheduler
from dask import ...

# Pass in an Executor
# https://docs.python.org/3/library/concurrent.futures.html#executor-objects
client = dask.x.y.z(...)
scheduler = Scheduler(executor=client.get_executor())

# Some native support (using dask-jobqueue)
scheduler = Scheduler.with_slurm(...)

# Using sklearns Loky Backend parallelism
scheduler = Scheduler.with_loky(...)

This will be considered use at your own risk and will rely on user contributions to ensure stability, we can not fully test with all possible backends. Using remote resources with a seperate file-system is low priority and untested (we can't afford AWS for testing) but theoretically possible.

eddiebergman commented 1 year ago

Hello, is there something here I can contribute to

Hi @Aditi840

We are not currently accepting contributions until this re-work is done but we appreciate the offer!

Aditi840 commented 1 year ago

@eddiebergman, fine

whoisltd commented 11 months ago

hi @eddiebergman did u have any comment about this issue https://github.com/automl/auto-sklearn/issues/1695? Why autosklearn cant stop training?, i think number cpu core, thread and process avaiable or something like these is the problem

hguturu commented 9 months ago

Hi @eddiebergman, any new updates on the anticipated timeline for the auto-sklearn update? Leaving behind scikit-learn < 0.25 would be very nice to have.

eddiebergman commented 9 months ago

We have a working prototype we can't share publicly yet, but in the meantime check out amltk. There's some new techniques towards optimization we want to try with auto-sklearn to solve some existing issues which is taking some time.

BradKML commented 6 months ago

@eddiebergman thanks for all these work! what other AutoML tools would you vouch for? (e.g. PyCaret, TPOT, H2O, EvalML)

eddiebergman commented 6 months ago

I highly recommend AutoGluon but you could also refer to methods properly evaluated on the AutoML Benchmark

automl / auto-sklearn