automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.64k stars 1.28k forks source link

[Question] Is any autosklearn.metrics.roc_auc WEIGHTED that is possible? #1531

Closed simonprovost closed 2 years ago

simonprovost commented 2 years ago

Dear Sir or Madam,

Is any autosklearn.metrics.roc_auc WEIGHTED that is available/possible?

System Details (if relevant)

eddiebergman commented 2 years ago

Hi @simonprovost,

You could implement your own metric if you need! You do miss out on some meta-learning by this but given enough time this shouldn't be so problematic.

https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_metrics.html#sphx-glr-examples-40-advanced-example-metrics-py

Hopefully that lets you implement what you need?

Best, Eddie

simonprovost commented 2 years ago

Hi @eddiebergman

Wonderful. Can you tell me if the metric function for AUROC is made available anywhere? So that I may design similar one that just returns the results of the positive class (in my case, class label 0 for a binary classification) and then assume that the AutoSklearn pipeline would optimise the AUROC of the positive class, class label 0 for my use case. Please, and thanks for your support once more !

Cheers

eddiebergman commented 2 years ago

Heyo,

So I think this might be what you're referring to? https://github.com/automl/auto-sklearn/blob/4f691a1db90f47a9118c193b52bccb797b604a64/autosklearn/metrics/__init__.py#L362-L368

Essentially we just wrap the sklearn one and its implementation, we don't want to go down the rabbit whole of defining our own metrics.

Best, Eddie

simonprovost commented 2 years ago

Hiya @eddiebergman,

Regarding the roc auc make score you showed, this one calculates the AUROC value by averaging the values of both classes if I am not wrong. But what we would like with my hospital research lab is to use the AUROC of the positive class, which for us is the class label 0 as a statistic (binary classification). After executing the predict proba method, we obtain the predicted probabilities of the class label 0 and class label 1, we take the class label 0 as it is our interest - [;0] - and provide them to the roc_auc_score Sci-kit Learn function usually. And this is how we want the pipeline to understand what our vision is. What our ideal score would be, is a high score in that specific "metric" if that makes sense. However, after a few moments of reflection following your answer, I am thinking that we cannot do that in changing the metric actually. Rather, I should probably adjust the evaluation function, would not you say?

Little context what we want to pursue this way: Due to the fact that we are dealing with a severely imbalanced event, we will not concentrate on the negative class AUROC value, but rather on the positive class AUROC value, which has a severely skewed ratio in the dataset. This positive class, is label 0, which I understand is not conventional but it is what it is in the medicine area sometimes ^^.

Thanks again for your assistance Eddit; you have made me realise that I may not be able to do that based on the metric we set to the pipeline. I will be awaiting your thoughts though to be sure,

Cheers,

eddiebergman commented 2 years ago

So it seems like this is an unbalanced binary classification problem with the positive class being 0?

Assuming that:

from autosklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix

def _mymetric(y_true, y_pred):

   # Invert labels so 0 is positive class
   # only works for binary classification
   flipped_true = ~y_true.astype(bool)
   flipped_pred = ~y_true.astype(bool)
   tn, fp, fn, tp = confusion_matrix(flipped_true, flipped_pred).ravel()

   # Return whichever one of them you care about
   return ...

make_scorer(
   name="mymetric",
   score_func=_mymetric,
   # ... see args and fill in as required
)

If I'm wrong on the assumptions, you could provide a minimal y, y_true and the expected output that captures what you need and I can maybe point you in the right direction.

Also to refer back to the original question and for people scrolling in which I didn't give a clear answer

sklearn metrics allow a sample_weights and we expose that as a parameter in autosklearn.metrics.Scorer but we do not have any way to pass that through the pipeline, i.e. autosklearn can not optimize for these sample_weights.

To create your own weights based on classes, you should use a custom metric and define the weight per label yourself. An example of creatinng your own metric is shown in this example


Best, Eddie

simonprovost commented 2 years ago

Hi @eddiebergman, That is a priceless response! Thank you so much, that might benefit more than one person in the community ☀️

If you do not mind, I will return to some of your points and ask some questions that immediately came to mind after your answer:

The simplest solution would be to flip the labels in that case, before reaching autosklearn and then use whatever metric you need. That way it conforms to the usual notion in binary classification where 1 is the positive label.

I doubt that this is a major concern from a medical aspect, but I would prefer to maintain the status quo if you can confirm that it will not result in additional pipeline optimisation issues.

If that's not possible for whatever reason, then I would imagine you have seen the various metrics surrounding a ConfusionMatrix which can often capture the kind of information you care about when considering highly imbalanced datasets. [...]

The concept of flipping y_true and y_pred was clever, thanks! Simple and reliable! Following this line of thought, I would like to pose the following question:

  1. My med. application’s metric interest would actually (after few discussion with my research colleagues) be the F-measure of the positive class (minority in our use-case, which is denoted by the class label 0) (because of the arithmetic mean of the recall and precision). Does the implementation below make sense to you, please?

def _mymetric(y_true, y_pred):
    labels = [0, 1]

    flipped_true = ~y_true.astype(bool)
    flipped_pred = ~y_true.astype(bool)

    f1_scores = f1_score(flipped_true, flipped_pred, average=None, labels=labels)
    f1_scores_with_labels = dict(zip(labels, f1_scores))

    return f1_scores_with_labels[1]

As a result of the flip of y_true and y_pred, I am calling the f1 score function and taking the label 1, which is my previous label 0 (positive class), correct?

  1. According to this https://arxiv.org/abs/2010.16253, the AUROC, may not finally be a very fair metric comparison in imbalanced classification in med. application, and it has been shown that the AUPRC (Area Under Precision-Recall Curve) does a better job handling the issue. So, in order to apply this measure, I did the following; does this make sense, please?:
def _mymetric(y_true, y_pred):

    flipped_true = ~y_true.astype(bool)
    flipped_pred = ~y_true.astype(bool)

    # y_pred are probabilities
    precision, recall, thresholds = precision_recall_curve(flipped_true, flipped_pred)
    return auc(recall, precision)

As a result of the flipping of y_true and y_pred, I am utilising the precision recall curve function and taking the label 1, which was previously labelled 0 (positive class), correct?

I will stop there, as I believe I have asked everything I wanted to on the custom metric. I am confident that it will aid the community regardless with these particular measures.

Thank you for your understanding Eddie, Best wishes

eddiebergman commented 2 years ago

Ello again,

So I don't want to confirm or deny your metrics make sense, metrics are one of the things AutoML can't control because it's very much a user defined objective. For clarification on how it effects optimization:

In light of that, I think the best way going forward is for you to come up with sample predictions and sample output and see if the metric you made makes sense and is correct for your application. You can always wrap your _mymetrics in autosklearn.metrics.make_scorer and then try it out and see if it does what you intended!

Best, Eddie

simonprovost commented 2 years ago

@eddiebergman Hello Again,

How the pipeline use this unique metric is crystal clear now. Thank you for the details. However, quite intruiging. Would it make sense if a custom metric returning the value of one class, such as the F-measure of the positive class, drew a large difference between the positive and negative class results in, for example, the classification report metrics (precision/recall/f1) of the best model?

I attempted to develop a custom metric for the F-measure of the positive class, and the pipeline discovered an algorithm (produced a model) that yielded excellent results for the positive class but nearly zero results for the negative class (overfitting the positive class actually). Would a (large) discrepancy of this magnitude make sense to you?

e.g, In comparison to the F-measure of both classes, such as F1-Macro, the distinction is evident. F1-Macro will generate a model with findings that are comparable between classes based on the F1 metric; however, the F-measure for the positive class alone will reveal a significant difference between those classes. I am currently running a lengthy run on the F-measure macro so that I can create a baseline against the F-measure of the positive class in order to provide precise outputs. Nonetheless, even without completing the run, I am confident that the F-measure macro will be superior, which is quite bothersome given that I want the best model to be evaluated based on the F-measure of the positive class (imbalanced binary classification).

Cheers,

eddiebergman commented 2 years ago

Hey @simonprovost,

Unfortunately I can't give good advice here as metrics are something AutoML isn't really aimed at solving, each problem will have it's own problems with metrics. I will point out the use of the scoring_functions argument to an estimators, i.e., AutoSklearnClassifier that can help get some other information outside if the optimization metric.

I would probably close the issue here if you don't mind as I believe these questions really go more in the realm of evaluating a model for your own use case, and while interesting, I'm not really equppied to give good advice, especially if it's to be used for medical research reasons.

Best, Eddie