automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.55k stars 1.28k forks source link

[Question] Model score returns zero after 30 minutes #1560

Open timzewing opened 2 years ago

timzewing commented 2 years ago

I’m using 2.0.

My dataset shape is 307511, 122. Of the 122 features, 106 are numerical, and 16 are categorical.

I cannot get a metric score for the dataset I’m using. I have changed the time_left_for_this_task to 600, 800, 1000, 1200, 1400, and 1800 to give more time to find the appropriate models, but with no luck. I also changed the time for per_run_time_limit to 300.

Code snippet self.automl = autosklearn.experimental.askl2.AutoSklearn2Classifier(time_left_for_this_task=1800, per_run_time_limit=300, n_jobs = -1, scoring_functions=metrics.roc_auc)

I have two questions:

  1. What should I do for auto-sklearn to select a model with a metric score for my dataset?
  2. Is there a size limit (rows and columns) on how big a dataset can auto-sklearn 2.0 handle?

-Tim

eddiebergman commented 2 years ago

Hi @timzewing,

  1. You should use metric=metrics.roc_auc instead of scoring_functions if roc_auc is what you want to optimize for. scoring_functions are for additional metrics you want to compute but are not directly used for optimization. Additionally, scoring_functions takes a list[Scorer] and not just a single Scorer, i.e. scoring_functions=[metrics.roc_auc].
  2. There is no dataset size limit, autosklearn will scale down the dataset size if there is some memory_limit given (which you don't), otherwise it will attempt to use the full data. How this is done will be parametrizable by the user in the next version (ctrl+f "dataset_compression"). However if you categorical columns have an insane amount of categories, this could cause the processed dataset to explode in size due to one hot encoding, leading to a large number of extra features being added.

You can use clf.sprint_statistics() to get a quick diagnoses. If you see that there was no successful trials, then likely the dataset size is becoming an issue with all the one-hot encoding, in which case, all we can recommend is handling the categoricals as you think best.

If you did have successful trials and metrics.roc_auc is not what you're optimize for then try using scoring_functions=[metrics.roc_auc], we should probably catch this and signal this likely error.

You should get a result after 600 seconds and you can remove the per_run_time_limit most likely. Feel free to share the output of sprint_statistics() so we can better help :)


Note to asklearn dev:


Best, Eddie

timzewing commented 2 years ago

@eddiebergman

I followed your instructions but didn’t get a model score for ROC AUC.

Below are the two scenarios which didn’t work:

Scenario 1: I ran the data on all 305k rows and 122 columns. I removed the per_run_time_limit, and it ran for 4026 seconds with no score.

Scenario 2: I ran the data on all 305k rows and only numerical columns(106 columns). I still didn’t get a score.

Screenshot of sprint_statistics():

Screen Shot 2022-08-11 at 14 54 08

Code snippet: automl = autosklearn.experimental.askl2.AutoSklearn2Classifier(per_run_time_limit=30, n_jobs = -1, metric=metrics.roc_auc)

Can you please suggest anything else I can try for this to work?

eddiebergman commented 2 years ago

Well two more things, have you tried just running the most basic thing you can, i.e. no specific metric?

clf = AutoSklearnClassifier(time_left_for_this_task=300)
clf.fit(X, y)
clf.sprint_statistics()

That's a very simple way to debug if it's your dataset causing the issue or a bug with the metric.

Beyond that, you can get access to the logs and debuggng information with this example.

Best, Eddie

timzewing commented 2 years ago

@eddiebergman

Scenario 3: As you suggested, I ran the dataset with only one parameter. I got back the below results. It selected a model with a metric score from accuracy. It took 5 minutes to run.

Screen Shot 2022-08-16 at 15 02 02

Scenario 4: I ran the dataset with two parameters, including ROC AUC, and didn't get a score. I ran it for 300, 600, 900, and 1200 seconds.

Screen Shot 2022-08-16 at 15 13 03

Scenario 5: I added the logs and debugging to my code, but I don't know what to look for in the report. I'm attaching the files here.

Screen Shot 2022-08-16 at 15 48 29

AutoML(1)_40646c18-1d9a-11ed-9fc5-acde48001122.log distributed.log

Please let me know what I can try next. Your help is much appreciated.

eddiebergman commented 2 years ago

Hi @timzewing, both of those logs are empty?

So I can't diagnose exactly because now the issue between the first posts and the last one are different, i.e. most of them are timing out now and not crashing, making me think that per_run_time_limit needs to be manually specified and higher.

You can try to just set the total time limit to 600 seconds and then per run time limit to something like 300 seconds, so it should only try to train 2 algorithms.

What you show makes me thing this is nothing to do with the fact that a different metric is passed and is purely to do with the dataset and timings somehow.

Could I also ask for a pip list | grep auto-sklearn?

Best, Eddie

timzewing commented 2 years ago

@eddiebergman

Can you please elaborate on what is pip list | grep auto-sklearn and how can I run it?

eddiebergman commented 2 years ago

@timzewing Sorry I assumed you had some command line experience. pip is used to manage python dependancies and pip list prints out all the installed packages. Using grep is a command line utility to filter text and grep auto-sklearn means to filter out all lines except one's that contain auto-sklearn. For example, when I run it using the development branch of auto-sklearn I get

auto-sklearn                  0.15.0 

Depending on how you installed auto-sklearn you will probably get something like

auto-sklearn                  0.14.7 

You can run !pip list in a jupyter notebook cell and it will tell you the version of all the packages you have installed and you can let me know which version you are using.

eddiebergman commented 2 years ago

Okay great, I would still need a non-empty log file to be able to provide any useful info. So far my only info to go on is:

I can tell you that autosklearn 0.14.7 is definitely working in our own experiments so the only factor that differs that I can't see is your data and your running environment.

So my suggestions are:

Best, Eddie

timzewing commented 2 years ago

Okay, let me run this and provide you the log.