automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.54k stars 1.27k forks source link

Can Autosklearn handle Multi-Class/Multi-Label Classification and which classifiers will it use? #1429

Open asmgx opened 2 years ago

asmgx commented 2 years ago

I have been trying to use AutoSklearn with Multi-class classification

so my labels are like this

0 1 2 3 4 ... 200 1 0 1 1 1 ... 1 0 1 0 0 1 ... 0 1 0 0 1 0 ... 0 1 1 0 1 0 ... 1 0 1 1 0 1 ... 0 1 1 1 0 0 ... 1 1 0 1 0 1 ... 0

I used this code

y = y[:, (65,67,54,133,122,63,102,105,39)]
X = df.drop(Code, axis=1, errors='ignore')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

automl = autosklearn.classification.AutoSklearnClassifier(
include={'feature_preprocessor': ["no_preprocessing"], 
 },
exclude={ 'classifier': ['random_forest']},
time_left_for_this_task=60*5,
per_run_time_limit=60*1,
memory_limit = 1024 * 10,
n_jobs=-1,
metric=autosklearn.metrics.f1_macro,
        )

but now I want to train Autosklearn on Multi-class Multi-label classification

Which method of these shall i use?

1-

clf = OneVsRestClassifier(automl, n_jobs=-1)
clf.fit(X_train, y_train)

2-


clf = automl
clf.fit(X_train, y_train)

3-

I have to loop one class at a time and use

clf = automl
clf.fit(X_train, y_train)

so it will be like

for i in (65,67,54,133,122,63,102,105,39):
       y = z[:, i]
       X = df.drop(Code, axis=1, errors='ignore')
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
      automl = autosklearn.classification.AutoSklearnClassifier(
      include={'feature_preprocessor': ["no_preprocessing"], 
       },
      exclude={ 'classifier': ['random_forest']},
      time_left_for_this_task=60*5,
      per_run_time_limit=60*1,
      memory_limit = 1024 * 10,
      n_jobs=1,
      metric=autosklearn.metrics.f1_macro,
              )

      clf = automl
      clf.fit(X_train, y_train)

so I get a different model for each label?

eddiebergman commented 2 years ago

Hey again @asmgx,

Just as a note, the example you give at first is multi-label as there are multiple label columns, and not just one.

Method 2 will not work as we do not natively support Multi-class mutli-label classification. This is due to the fact sklearn models usually don't support this naitevly and require adapters, similiar to the ones you show in option 1.. However option 1. will also not work, read the description of it carefully https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn-multiclass-onevsrestclassifier. It supports one or the other but not both simultaneously.

In general, I don't think support for multi-class multi-label is very widespread and I would advise reframing the problem as you suggest in 3.. One option as you suggest is to fit one classifier per multi-class target column, combining their results at the end. Another option is basically one hot-encode each multi-class target column into multiple binary one. In the same way you can one-hot encode categorical columns, you can do the same to target columns which contain multiple classes, repeating this for each column in your output. This can increase your target columns dramitically depending on the number of classes and it also makes translating between your original targets and the one-hot encoded variant more difficult to implement.

But to reiterate, we don't support it natively and implementation is left to the user.

Best, Eddie

vgargan2 commented 2 years ago

Hello to all,

For my undergraduate thesis, I am trying to benchmark some automl tools. Specifically, I am trying to plot ROC curves and calculate Area under ROC for multiclass (not multilabel) classification for some datasets coming from OpenML-CC18 using Autosklearn. Basicaly I am trying to implement this using AutoSklearnClassifier.

As eddiebergman already correctly pointed out, the clf = OneVsRestClassifier(automl, n_jobs=-1) clf.fit(X_train, y_train) bit cann't be directly used.

Can you please provide me an example of how can be done?

Thanks in advance!

eddiebergman commented 2 years ago

Hi @vgargan2,

We support regular Multi-class classification out of the box. I realize we don't have an example to show this but we regular test on benchmark openml/s/218 which is similar in spirit to OpenML-CC18.

Incase this thread begins to confuse other readers, I'm going to make the 4 distinctions and clarify which we support.

Best, Eddie

asmgx commented 2 years ago

@eddiebergman this is confusing. you are saying that Mutlilabel Classification is supported, which is the same example I mentioned in the beginning of this post.

Do you mean if I have a data set with targeted values as following is Supported?

RowNo   Feature1  Feature2  Feature3   |  Label1   Label2   Label3   Label4   Label5
-------------------------------------------------------------------------------------------
1               73             84            34         |       0           1             1           0           1
2               37             88            84         |       0           0             0           1           1
3               93             90            58         |       1           0             1           1           0
4               77             44            66         |       1           1             1           0           0
5               48             82            38         |       1           1             0           1           1
6               53             87            42         |       0           1             0           0           0
7               80             55            28         |       1           0             0           1           0
8               66             74            97         |       0           0             1           1           1

Can you advice how can we work with this example?

eddiebergman commented 2 years ago

@asmgx, I apologise, I misread your example in the very first section. Yes it would support that example which is multilabel. Nothing needs to be done to support it, autosklearn will work out of the box with those labels automl = AutoSklearnClassifier(); automl.fit(X, y)

I read the column headers as being non binary and assumed you meant multiclass-multilabel classification, especially given the title of the issue.

This whole issue seems to illuminate that we should have a clear section about this. I also sometimes mix up which is multiclass vs multilabel as well as I don't expect everyone knows that you can combine the two to get the entirely different multiclass-multilabel which sklearn has limited support for.

For those scrolling to the bottom of the issue

# Nothing has to be done for mutli-label OR multi-class
X = np.random.rand(4, 2)  # 4 examples, 2 features

# For binary
binary_y = [1, 0, 1, 1]
automl = AutoSklearnClassifier()
automl.fit(X, binary_y)

# For multiclass
multiclass_y = [1, 2, 0, 2]
automl = AutoSklearnClassifier()
automl.fit(X, multiclass_y)

# For multilabel
multilabel_y = [[1, 0], [0, 0], [1, 1], [1, 0]]
automl = AutoSklearnClassifier()
automl.fit(X, multilabel_y)

# For multiclass-multilabel y
# NOT SUPPORTED
mutliclass_multilabel_y = [[1, 2], [0, 2], [0, 0], [2, 1]]
asmgx commented 2 years ago

@eddiebergman Thanks, is there more documentation on how does AutoSklearn support Multi-Label datasets? How it does build its models? I know that not all Algorithms support Multi-Labels natively, so does it use OneVsRestClassifier internally or does it loop over all the labels?

Any documents support that?

eddiebergman commented 2 years ago

There are no special things done, when doing multi-label classification, we only consider models that natively support multilabel classification.

https://github.com/automl/auto-sklearn/blob/6cc8bb179fcb023d1c341cf33d2958a16a6935be/autosklearn/pipeline/components/classification/__init__.py#L68

There's no document to support this but there probably should be to describe all this.

mfeurer commented 2 years ago

We document the supported tasks here, but we should potentially rename this to "support target types" and link to scikit-learn's glossary, for example for multi-label we should make this a link to https://scikit-learn.org/stable/glossary.html#term-multilabel. Indeed, we have no documentation on which classifier is used for which target types and it would be great to have that.