automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.62k stars 1.28k forks source link

Document AutoSklearnClassifier constructor options #43

Closed Motorrat closed 8 years ago

Motorrat commented 8 years ago

Dear auto-sklearn team, I have just learned about this project and am very excited to try to include it into my modelling flow!

It seems the command line option names for autosklearn are not the same as what AutoSklearnClassifier() constructor would accept. So I have kind of reverse engineered a few but I still can not figure out whether it is possible to specify task_type for example. task_type="binary.classification" is rejected by the AutoSklearnClassifier() constructor.

I understand this is a very young project and is actively worked on. Will be happy to supply you with the feedback from the fields as I am actively running modelling experiments on various datasets available at my company. Currently I am successfully using scikit learn with SGD Classifier for one. Is there a better way to reconnect with you in a forum or a chat somewhere to ask questions or give feedback?

mfeurer commented 8 years ago

Hi, currently, github issues are the only place to ask questions or give feedback. If something looks odd or you have a question, just open an issue.

I'm not sure if I get what you want to achieve. Can you please provide the following information:

Motorrat commented 8 years ago

Thanks Matthias, github issues are good enough for me. I am running AutoSklearnClassifier() from within of a python scikit program as a drop in replacement for the SGDClassifier. I am interested in Binary Classification.

There are many options shown by autosklearn. Wouldn't AutoSklearnClassifier() constructor accept them too?

[--temporary-output-directory TEMPORARY_OUTPUT_DIRECTORY] [--keep-output] [--time-limit TIME_LIMIT] [--per-run-time-limit PER_RUN_TIME_LIMIT] [--ml-memory-limit ML_MEMORY_LIMIT] [--metalearning-configurations METALEARNING_CONFIGURATIONS] [--ensemble-size ENSEMBLE_SIZE] [--ensemble-nbest ENSEMBLE_NBEST] [--include-estimators [INCLUDE_ESTIMATORS [INCLUDE_ESTIMATORS ...]]] [--include-preprocessors [INCLUDE_PREPROCESSORS [INCLUDE_PREPROCESSORS ...]]] [-s SEED] [--exec-dir EXEC_DIR] [--metadata-directory METADATA_DIRECTORY] --data-format {automl-competition-format,arff} --dataset DATASET [--task {binary.classification,multiclass.classification,multilabel.classification,regression}] [--metric {f1,r2_metric,acc_metric,a,acc,auc,bac_metric,r2,pac_metric,f1_metric,pac,bac,a_metric,auc_metric}] [--target TARGET] [--resampling-strategy {holdout,cv,partial-cv,nested-cv,holdout-iterative-fit}] [--folds FOLDS] [--outer-folds OUTER_FOLDS] [--inner-folds INNER_FOLDS]

mfeurer commented 8 years ago

The AutoSklearnClassifier should figure out the task type by itself. You can have a look at the online documentation as well as the examples and replace the digits dataset with your own data.

The script you looked can do more than the AutoSklearnClassifier, more specifically it also loads the datasat while for the class you have to load the data yourself.

Motorrat commented 8 years ago

Thanks Matthias, I must have overlooked that page. I have already tested with the "digits" and had a few runs with my own data with promising results!

-- Diego

On 15 Feb 2016, at 15:03, Matthias Feurer notifications@github.com wrote:

The AutoSklearnClassifier should figure out the task type by itself. You can have a look at the online documentation as well as the examples and replace the digits dataset with your own data.

The script you looked can do more than the AutoSklearnClassifier, more specifically it also loads the datasat while for the class you have to load the data yourself.

— Reply to this email directly or view it on GitHub.

mfeurer commented 8 years ago

You're welcome. If you have any idea where we can put the documentation so that it's easier to find let us know. Moreover, please let us know if auto-sklearn improved over your current ML pipeline.

Anyway, can I close this issue for now?

Motorrat commented 8 years ago

Yes, this can be closed. Thanks for your support. May be just copying class autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600, per_run_time_limit=360, initial_configurations_via_metalearning=25, ensemble_size=50, ensemble_nbest=50, seed=1, ml_memory_limit=3000, include_estimators=None, include_preprocessors=None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, output_folder=None, delete_tmp_folder_after_terminate=True, delete_output_folder_after_terminate=True, shared_mode=False) just below the Heading "Manual" would be more intuitive. I recall now that I have checked the API page before but just somehow ignored it when I really needed this information. I like how they do it on sklearn website. One can quickly get an overview of what parameters are supported and then read a short description. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

The results for auto-sklearn and my best SGDClassifier are actually close. I need to make sure I have exactly the same dataset and accuracy metric for both and will get back to with the result.

mfeurer commented 8 years ago

Thanks for your feedback, we will consider it when updating the documentation the next time.

Motorrat commented 8 years ago

Goood results, overall! I am still learning about the tool so I might post an update.

This is my initial classifier clf = SGDClassifier(alpha=0.0001, n_iter=100, penalty='l2', loss="log" , random_state=14, class_weight='balanced') Test Precision: 0.34, Recall: 0.91, F1: 0.50

metric = 'auc_metric' clf = AutoSklearnClassifier( time_left_for_this_task=300, per_run_time_limit=90, ml_memory_limit=10000) Test Precision: 0.36 Recall: 0.91 F1: 0.51

metric = f1_metric clf = AutoSklearnClassifier( time_left_for_this_task=300, per_run_time_limit=90, ml_memory_limit=10000) Test Precision: 0.50 Recall: 0.71 F1: 0.58

metric = f1_metric clf = AutoSklearnClassifier( ml_memory_limit=10000) Test Precision: 0.46 Recall: 0.74 F1: 0.57

Interesting enough - with your defaults the F1 decreased to .57 compared to the .58 for the 5 minutes run. Slight overfitting? Will have to experiment with crossvalidation.

Just to give you an idea of how many rows there are in the dataset I show one Test Confusion Matrix: [[7935 473] [ 193 467]] the train dataset is about ten times larger.

The scores are calculated using sklearn metrics package:

from sklearn.metrics import (confusion_matrix, precision_score , recall_score, f1_score) precision_score(y_true, y_pred) recall_score(y_true, y_pred) f1_score(y_true, y_pred)) confusion_matrix(y_true, y_pred)

mfeurer commented 8 years ago

Thanks for reporting this. We have also observed some overfitting in the holdout setting with datasets smaller than 100.000 samples; but did not yet study this problem in a systematic setting. Do you already have some results for the CV setting?

Motorrat commented 8 years ago

I am using the above accuracy metric and need the prediction to calculate it but I get NotImplementedError: Predict is currently only implemented for resampling strategy holdout.

y_pred = clf.predict(X_test_t)

File "/home/centos/anaconda2/lib/python2.7/site-packages/AutoSklearn-0.0.1.dev0-py2.7-linux-x86_64.egg/autosklearn/estimators.py", line 286, in predict return super(AutoSklearnClassifier, self).predict(X) File "/home/centos/anaconda2/lib/python2.7/site-packages/AutoSklearn-0.0.1.dev0-py2.7-linux-x86_64.egg/autosklearn/automl.py", line 591, in predict return np.argmax(self.predict_proba(X), axis=1) File "/home/centos/anaconda2/lib/python2.7/site-packages/AutoSklearn-0.0.1.dev0-py2.7-linux-x86_64.egg/autosklearn/estimators.py", line 300, in predict_proba return super(AutoSklearnClassifier, self).predict_proba(X) File "/home/centos/anaconda2/lib/python2.7/site-packages/AutoSklearn-0.0.1.dev0-py2.7-linux-x86_64.egg/autosklearn/automl.py", line 601, in predict_proba 'Predict is currently only implemented for resampling ' NotImplementedError: Predict is currently only implemented for resampling strategy holdout.

mfeurer commented 8 years ago

I'm sorry for that, the exception text is not yet updated. You need to call the refit() method in order to use predict() with cross validation. While auto-sklearn finds a configuration with fit(), it does not train a model on the full dataset which can be used for predictions later on. While auto-sklearn simpley uses the one trained model in the holdout case, it does not do so in the cv case. Instead, it requires the user to retrain the configurations on the full dataset. I will add a more useful error message, telling the user what went wrong.

Motorrat commented 8 years ago

Thanks, will try that.

In my use case of binary classification I optimize for auc, but need to guarantee some minimum accuracy. This way call center agents do not get too many bad leads and get frustrated. Currently I use a predict_probability method and adjust the threshold that corresponds tho that minimum accuracy for the test dataset. May be this falls into the topic of "refit", may be not. I just wanted to ask you if this seems like something you'd want to add to your toolset? This may be a very common use case in the real world where the data is not ideal and accuracy is always a concern.

-- Diego

On 17 Feb 2016, at 14:04, Matthias Feurer notifications@github.com wrote:

I'm sorry for that, the exception text is not yet updated. You need to call the refit() method in order to use predict() with cross validation. While auto-sklearn finds a configuration with fit(), it does not train a model on the full dataset which can be used for predictions later on. While auto-sklearn simpley uses the one trained model in the holdout case, it does not do so in the cv case. Instead, it requires the user to retrain the configurations on the full dataset. I will add a more useful error message, telling the user what went wrong.

— Reply to this email directly or view it on GitHub.

Motorrat commented 8 years ago

Now with refit() method.

clf = AutoSklearnClassifier( ml_memory_limit=10000, resampling_strategy='cv', resampling_strategy_arguments={'folds':5})

metric=auc_metric 5 minutes Precision: 0.36 Recall: 0.91 F1: 0.51 1 hour Precision: 0.36 Recall: 0.90 F1: 0.52

metric=f1_metric 1 hour Precision: 0.50 Recall: 0.69 F1: 0.58

mfeurer commented 8 years ago

Regarding your target metric. Do I understand correctly that you want to optimize a metric which is basically accuracy, but with a threshold that is set so that a minimum accuracy is achieved? I think coding new metrics is beyond the scope of our project. Nevertheless, we plan to allow user-defined metrics, so you could code this and tell auto-sklearn to optimize it. Would this help you?

Motorrat commented 8 years ago

No, I want largest possible recall at a certain minimal accuracy or better. Does it make sense?

-- Diego

On 18 Feb 2016, at 13:38, Matthias Feurer notifications@github.com wrote:

Regarding your target metric. Do I understand correctly that you want to optimize a metric which is basically accuracy, but with a threshold that is set so that a minimum accuracy is achieved? I think coding new metrics is beyond the scope of our project. Nevertheless, we plan to allow user-defined metrics, so you could code this and tell auto-sklearn to optimize it. Would this help you?

— Reply to this email directly or view it on GitHub.

mfeurer commented 8 years ago

I think I've got it; you want to maximize recall, but also optimize accuracy as a second objective. This is a very hard problem in our setting since would have to adapt the meta-learning step, the global optimization step and the ensemble building step. We do not plan to do this in the future, but we will at least allow custom target metrics. Such a custom target metric could take a tradeoff between recall and accuracy into account.

I'm currently asking myself whether 1 hour is sufficient for the size of your dataset. Since you have ~9000 test points, I assume you have something like ~20000 training points. Could you try to run auto-sklearn over night?

Motorrat commented 8 years ago

I have run it overnight : metric=f1_metric time_left_for_this_task=36000, per_run_time_limit=900, 10 hours Precision: 0.49 Recall: 0.70 F1: 0.58

BTW, for the first time I got this warning: warnings.warn("Mean of empty slice", RuntimeWarning) /home/centos/anaconda2/lib/python2.7/site-packages/sklearn/lda.py:371: UserWarning: Variables are collinear. warnings.warn("Variables are collinear.")

Is there a way to find out which variables are considered to be collinear by auto-sklearn? Or do I have to run a usual collinearity analysis outside the toolkit?

mfeurer commented 8 years ago

Yes, you have to use a toolkit outside of auto-sklearn. We only provide the classification pipeline.