AutoMLs Benchmark. Why is Auto-sklearn so bad?

Alex-Lekov commented 3 years ago

This is not a criticism - I really want to understand.

I made Benchmark AutoML libs, and Auto-sklearn showed very poor results, even worse than the usual CatBoost with standard parameters! https://github.com/Alex-Lekov/AutoML-Benchmark/ I run the benchmark in docker - so you can easily reproduce it

here is the code from the benchmark:

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=TIME_LIMIT,
    metric=autosklearn.metrics.roc_auc,
    seed=RANDOM_SEED,)

automl.fit(X_train, y_train)
predictions = automl.predict_proba(X_test)

Is the code correct? (I do not adjust the advanced parameters, since AutoML, in theory, should pick everything up by itself, that's why it is AutoML)

I also tried AutoSklearn2Classifier but it constantly crashes with different errors

Tell me what am I doing wrong? or is this a real result and the library is really that bad?

mfeurer commented 3 years ago

Thanks for bringing this up. We'll definitely have a look into this, but we're currently a bit swamped so it might take a bit for a definite answer.

I just had a very brief look at your code and found the following things:

You manually preprocess the data (https://github.com/Alex-Lekov/AutoML-Benchmark/blob/master/binary-classification/frameworks/autosklearn/model.py#L34), which is not necessary. The only thing necessary (up to 0.7, from 0.8 we'll handle that ourselves, will be there in the next one or two weeks) is to do an encoding of the categoricals as you do for catboost.
You need to pass the feature types to auto-sklearn, see this example: https://automl.github.io/auto-sklearn/master/examples/example_feature_types.html#sphx-glr-examples-example-feature-types-py
We didn't test that Auto-sklearn works with pandas in 0.7, it just happens to not complain. Therefore, it would be best if you cast the arrays to numpy.
What's the reason for this except: https://github.com/Alex-Lekov/AutoML-Benchmark/blob/master/binary-classification/frameworks/autosklearn/model.py#L64 It looks suspicious if this is necessary, we haven't had any issues here ourselves

We're usually working with the following AutoML benchmark: https://openml.github.io/automlbenchmark/ . Could you please let us know what's the difference to your benchmark? As your results show that LightGBM and CatBoost are strong competitors, we'll probably add them to the OpenML AutoMLBenchmark then.

Alex-Lekov commented 3 years ago

Thanks for the answer.

It's nice that you continue to work on improving the library. I think it's really important for AutoML to be able to work with raw dataset.
I tried to transfer types of feature without processing - but the test crashed on some of the datasets, so I had to do processing instead of the library.
What's the reason for this except: https://github.com/Alex-Lekov/AutoML-Benchmark/blob/master/binary-classification/frameworks/autosklearn/model.py#L64 It looks suspicious if this is necessary, we haven't had any issues here ourselves

Sometimes in the process of optimization the algorithm won without predict_proba (which is very strange), so I had to implement such a construction.

My benchmark is different in that I chose datasets in which the problem has not yet been solved at 99 AUC. And in which there are more than 1000 examples in dataset (to avoid the influence of randomness).
Perhaps if you added LightGBM and CatBoost to the Auto-sklearn, the results would be much better.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.

eddiebergman commented 2 years ago

I'm closing this as it's very out-dated and a lot has changed by now. I don't see a good reason to keep this open.

simonprovost commented 2 years ago

Any explanation as to why it generates poorer results than practically every other framework in the benchmark? Knowing that Auto-Sklearn has won numerous Kaggle championships, etc. that perplexed me, @eddiebergman @mfeurer

@Alex-Lekov, may we ask to try with the latest 0.14.7 version, maybe? see if improvements have occurred solving this GitHub issue right after.

Cheers

eddiebergman commented 2 years ago

For sure, feel free to benchmark it how you wish :) We still use the automl-benchmark as our go to set of 39 datasets to test on. We can't speak for the validity of custom benchmarks but we use this one as it's been peer reviewed and often used for comparison.

I took a look at the source code and the setup for AutoSklearn and others. There seems to be no control for CPU allocation in which the default for AutoGluon is to use as much as it thinks it needs while for AutoSklearn, it just uses 1. It also seems that Automl_Alex isn't limited by any optimization time which leads to unfair conditions between the frameworks. I believe the validity of the set up would have benifited greatly from reaching out to the authors of the various frameworks prior to ensure that each framework is compared with equal conditions.

I took a look at the tabular results as well and it seems most frameworks are basically equal to within 0.01-0.02, meaning the ranking is a lot more susceptible to noise. As also pointed out CatBoost/LightGBM is included in AutoML_Alex and they are simply just one of the best tabular learners as a default, especially for smaller datasets, leading to it being usually at the top. If you were to plot a mean normalized AUC score over all datasets, I imagine you might see a slightly different picture.

It is interesting that auto-sklearn failed on the amazon dataset but we have no idea why and for whatever the reason, I would hope that is fixed by now.

If you have other ideas as to why it might have performed so badly by their ranking system, we would love some insight.

Best, Eddie

simonprovost commented 2 years ago

That is obviously a comprehensive examination, @eddiebergman. Many thanks for your response and prompt answer!!🥳

I would say that the automl-benchmark peer-reviewed paper/method is significantly superior to AutoML Alex in terms of fair-comparison and reliability, as you also mentioned. Nevertheless, it was intriguing to see AutoML Alex bring that up. However, when I attempted to operate this system, nothing worked well, therefore I doubt that it is a good comparative benchmark, as you primarily asserted via various criteria.

On the other hand, in my current use case, I was doing lengthy Auto-Sklearn runs on the cloud and obtaining consistently bad results (~not better than random), whereas with Auto-Gluon in a single, brief run, I was obtaining results that were far superior to random finds. This caught my eye because Auto-Sklearn has won multiple competitions and the creators (including you) continue to work on it actively! And this was a result of my AutoSklearn implementation not requesting the appropriate metric to optimise on! This may also indicate why it was badly represented in the AutoML Alex benchmark! Auto-Sklearn seems to place a greater emphasis on selecting the optimal optimisation metric than the majority of other AutoML frameworks I have examined (being yet beginner to it). (Note: I do not have much time to delve into the code of AutoML alex, but I am curious about the metrics he chose to optimise on (using Auto-Sklearn) and whether he chose the same metrics for every framework he benchmarked.)

Regardless, I believe I have the correct metric now to ask my Auto-Sklearn pipeline to optimise on, although I will post a new question that you may be able to answer easily, since I was curious about something on this particular subject (i.e, the chosen metric to optimise-on by the AutoML system). See you on the other side.

Thanks, once more, so much for your investigation, @eddiebergman!

Wonderful day☀️

eddiebergman commented 2 years ago

Regarding the metric, we don't optimize for the best metric, we use sklearn's type_of_target function to determine what kind of target it is then just select a default so I'm not sure why your implementation would vary so wildly. However the fact it got worse than random seems really odd and if you had code to share I could point you in the right direction or at least illuminate to us something that is broken.

Best, Eddie!

simonprovost commented 2 years ago

Now it makes more sense! Given the metric or type of target function, it can alter anyway as it is different metric and goal behind them, but it is odd that the change would vary so much indeed as you said. Perhaps my data is quite intricate, I do not know. Let's see how goes our other discussion on the metric, and I will provide you with further information on the code so that you may verify it.

Thanks so much! Cheers,

BradKML commented 1 year ago

In that case @Alex-Lekov would you consider a retest of AutoML, since there has been too many advancements in GB-like models? It is already possible to hot-plug the relevants algorithms into Auto-sklearn, and auto_ml already has third-party support for them and ready to be re-tested (even when it is already deprecated).

automl / auto-sklearn

AutoMLs Benchmark. Why is Auto-sklearn so bad? #923