Open timzewing opened 2 years ago
Hi @timzewing,
metric=metrics.roc_auc
instead of scoring_functions
if roc_auc
is what you want to optimize for. scoring_functions
are for additional metrics you want to compute but are not directly used for optimization. Additionally, scoring_functions
takes a list[Scorer]
and not just a single Scorer
, i.e. scoring_functions=[metrics.roc_auc]
.memory_limit
given (which you don't), otherwise it will attempt to use the full data. How this is done will be parametrizable by the user in the next version (ctrl+f "dataset_compression"). However if you categorical columns have an insane amount of categories, this could cause the processed dataset to explode in size due to one hot encoding, leading to a large number of extra features being added.You can use clf.sprint_statistics()
to get a quick diagnoses. If you see that there was no successful trials, then likely the dataset size is becoming an issue with all the one-hot encoding, in which case, all we can recommend is handling the categoricals as you think best.
If you did have successful trials and metrics.roc_auc
is not what you're optimize for then try using scoring_functions=[metrics.roc_auc]
, we should probably catch this and signal this likely error.
You should get a result after 600
seconds and you can remove the per_run_time_limit
most likely. Feel free to share the output of sprint_statistics()
so we can better help :)
Note to asklearn dev:
scoring_functions
when not passed a list and act accordingly, either updating the signature if it works, otherwise raise an explicit error early. Best, Eddie
@eddiebergman
I followed your instructions but didn’t get a model score for ROC AUC.
Below are the two scenarios which didn’t work:
Scenario 1: I ran the data on all 305k rows and 122 columns. I removed the per_run_time_limit, and it ran for 4026 seconds with no score.
Scenario 2: I ran the data on all 305k rows and only numerical columns(106 columns). I still didn’t get a score.
Screenshot of sprint_statistics():
Code snippet: automl = autosklearn.experimental.askl2.AutoSklearn2Classifier(per_run_time_limit=30, n_jobs = -1, metric=metrics.roc_auc)
Can you please suggest anything else I can try for this to work?
Well two more things, have you tried just running the most basic thing you can, i.e. no specific metric?
clf = AutoSklearnClassifier(time_left_for_this_task=300)
clf.fit(X, y)
clf.sprint_statistics()
That's a very simple way to debug if it's your dataset causing the issue or a bug with the metric.
Beyond that, you can get access to the logs and debuggng information with this example.
Best, Eddie
@eddiebergman
Scenario 3: As you suggested, I ran the dataset with only one parameter. I got back the below results. It selected a model with a metric score from accuracy. It took 5 minutes to run.
Scenario 4: I ran the dataset with two parameters, including ROC AUC, and didn't get a score. I ran it for 300, 600, 900, and 1200 seconds.
Scenario 5: I added the logs and debugging to my code, but I don't know what to look for in the report. I'm attaching the files here.
AutoML(1)_40646c18-1d9a-11ed-9fc5-acde48001122.log distributed.log
Please let me know what I can try next. Your help is much appreciated.
Hi @timzewing, both of those logs are empty?
So I can't diagnose exactly because now the issue between the first posts and the last one are different, i.e. most of them are timing out now and not crashing, making me think that per_run_time_limit
needs to be manually specified and higher.
You can try to just set the total time limit to 600 seconds and then per run time limit to something like 300 seconds, so it should only try to train 2 algorithms.
What you show makes me thing this is nothing to do with the fact that a different metric is passed and is purely to do with the dataset and timings somehow.
Could I also ask for a pip list | grep auto-sklearn
?
Best, Eddie
@eddiebergman
Can you please elaborate on what is pip list | grep auto-sklearn and how can I run it?
@timzewing Sorry I assumed you had some command line experience. pip
is used to manage python dependancies and pip list
prints out all the installed packages. Using grep
is a command line utility to filter text and grep auto-sklearn
means to filter out all lines except one's that contain auto-sklearn
. For example, when I run it using the development branch of auto-sklearn I get
auto-sklearn 0.15.0
Depending on how you installed auto-sklearn you will probably get something like
auto-sklearn 0.14.7
You can run !pip list
in a jupyter notebook cell and it will tell you the version of all the packages you have installed and you can let me know which version you are using.
Okay great, I would still need a non-empty log file to be able to provide any useful info. So far my only info to go on is:
metric.roc_auc
as this happens with and without itI can tell you that autosklearn 0.14.7
is definitely working in our own experiments so the only factor that differs that I can't see is your data and your running environment.
So my suggestions are:
Best, Eddie
Okay, let me run this and provide you the log.
I’m using 2.0.
My dataset shape is 307511, 122. Of the 122 features, 106 are numerical, and 16 are categorical.
I cannot get a metric score for the dataset I’m using. I have changed the time_left_for_this_task to 600, 800, 1000, 1200, 1400, and 1800 to give more time to find the appropriate models, but with no luck. I also changed the time for per_run_time_limit to 300.
Code snippet self.automl = autosklearn.experimental.askl2.AutoSklearn2Classifier(time_left_for_this_task=1800, per_run_time_limit=300, n_jobs = -1, scoring_functions=metrics.roc_auc)
I have two questions:
-Tim