Closed simonprovost closed 2 years ago
Hi @simonprovost,
You could implement your own metric if you need! You do miss out on some meta-learning by this but given enough time this shouldn't be so problematic.
Hopefully that lets you implement what you need?
Best, Eddie
Hi @eddiebergman
Wonderful. Can you tell me if the metric function for AUROC is made available anywhere? So that I may design similar one that just returns the results of the positive class (in my case, class label 0 for a binary classification) and then assume that the AutoSklearn pipeline would optimise the AUROC of the positive class, class label 0 for my use case. Please, and thanks for your support once more !
Cheers
Heyo,
So I think this might be what you're referring to? https://github.com/automl/auto-sklearn/blob/4f691a1db90f47a9118c193b52bccb797b604a64/autosklearn/metrics/__init__.py#L362-L368
Essentially we just wrap the sklearn one and its implementation, we don't want to go down the rabbit whole of defining our own metrics.
Best, Eddie
Hiya @eddiebergman,
Regarding the roc auc make score you showed, this one calculates the AUROC value by averaging the values of both classes if I am not wrong. But what we would like with my hospital research lab is to use the AUROC of the positive class, which for us is the class label 0 as a statistic (binary classification). After executing the predict proba method, we obtain the predicted probabilities of the class label 0 and class label 1, we take the class label 0 as it is our interest - [;0]
- and provide them to the roc_auc_score Sci-kit Learn function usually. And this is how we want the pipeline to understand what our vision is. What our ideal score would be, is a high score in that specific "metric" if that makes sense. However, after a few moments of reflection following your answer, I am thinking that we cannot do that in changing the metric actually. Rather, I should probably adjust the evaluation function, would not you say?
Little context what we want to pursue this way: Due to the fact that we are dealing with a severely imbalanced event, we will not concentrate on the negative class AUROC value, but rather on the positive class AUROC value, which has a severely skewed ratio in the dataset. This positive class, is label 0, which I understand is not conventional but it is what it is in the medicine area sometimes ^^.
Thanks again for your assistance Eddit; you have made me realise that I may not be able to do that based on the metric we set to the pipeline. I will be awaiting your thoughts though to be sure,
Cheers,
So it seems like this is an unbalanced binary classification problem with the positive class being 0
?
Assuming that:
The simplest solution would be to flip the labels in that case, before reaching autosklearn and then use whatever metric you need. That way it conforms to the usual notion in binary classification where 1
is the positive label.
If that's not possible for whatever reason, then I would imagine you have seen the various metrics surrounding a ConfusionMatrix
which can often capture the kind of information you care about when considering highly imbalanced datasets. I have some vague memories from my college days where there was a lot of mention of minimizing false_negative_rate
over accuracy
for cancer detection, you really don't want to accidentally tell someone they're safe when they're not. Apologies if you're aware of these metrics but given the assumptions above, it seems like the kind of metrics you would be interested in? You can get the components for all of those from sklearn.metrics.confusion_matrix
from autosklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
def _mymetric(y_true, y_pred):
# Invert labels so 0 is positive class
# only works for binary classification
flipped_true = ~y_true.astype(bool)
flipped_pred = ~y_true.astype(bool)
tn, fp, fn, tp = confusion_matrix(flipped_true, flipped_pred).ravel()
# Return whichever one of them you care about
return ...
make_scorer(
name="mymetric",
score_func=_mymetric,
# ... see args and fill in as required
)
If I'm wrong on the assumptions, you could provide a minimal y
, y_true
and the expected output that captures what you need and I can maybe point you in the right direction.
sklearn metrics allow a sample_weights
and we expose that as a parameter in autosklearn.metrics.Scorer
but we do not have any way to pass that through the pipeline, i.e. autosklearn can not optimize for these sample_weights
.
To create your own weights based on classes, you should use a custom metric and define the weight per label yourself. An example of creatinng your own metric is shown in this example
Best, Eddie
Hi @eddiebergman, That is a priceless response! Thank you so much, that might benefit more than one person in the community ☀️
If you do not mind, I will return to some of your points and ask some questions that immediately came to mind after your answer:
The simplest solution would be to flip the labels in that case, before reaching autosklearn and then use whatever metric you need. That way it conforms to the usual notion in binary classification where 1 is the positive label.
I doubt that this is a major concern from a medical aspect, but I would prefer to maintain the status quo if you can confirm that it will not result in additional pipeline optimisation issues.
If that's not possible for whatever reason, then I would imagine you have seen the various metrics surrounding a ConfusionMatrix which can often capture the kind of information you care about when considering highly imbalanced datasets. [...]
The concept of flipping y_true
and y_pred
was clever, thanks! Simple and reliable! Following this line of thought, I would like to pose the following question:
def _mymetric(y_true, y_pred):
labels = [0, 1]
flipped_true = ~y_true.astype(bool)
flipped_pred = ~y_true.astype(bool)
f1_scores = f1_score(flipped_true, flipped_pred, average=None, labels=labels)
f1_scores_with_labels = dict(zip(labels, f1_scores))
return f1_scores_with_labels[1]
As a result of the flip of y_true
and y_pred
, I am calling the f1 score function and taking the label 1, which is my previous label 0 (positive class), correct?
def _mymetric(y_true, y_pred):
flipped_true = ~y_true.astype(bool)
flipped_pred = ~y_true.astype(bool)
# y_pred are probabilities
precision, recall, thresholds = precision_recall_curve(flipped_true, flipped_pred)
return auc(recall, precision)
As a result of the flipping of y_true
and y_pred
, I am utilising the precision recall curve function and taking the label 1, which was previously labelled 0 (positive class), correct?
I will stop there, as I believe I have asked everything I wanted to on the custom metric. I am confident that it will aid the community regardless with these particular measures.
Thank you for your understanding Eddie, Best wishes
Ello again,
So I don't want to confirm or deny your metrics make sense, metrics are one of the things AutoML can't control because it's very much a user defined objective. For clarification on how it effects optimization:
In light of that, I think the best way going forward is for you to come up with sample predictions and sample output and see if the metric you made makes sense and is correct for your application. You can always wrap your _mymetrics
in autosklearn.metrics.make_scorer
and then try it out and see if it does what you intended!
Best, Eddie
@eddiebergman Hello Again,
How the pipeline use this unique metric is crystal clear now. Thank you for the details. However, quite intruiging. Would it make sense if a custom metric returning the value of one class, such as the F-measure of the positive class, drew a large difference between the positive and negative class results in, for example, the classification report metrics (precision/recall/f1) of the best model?
I attempted to develop a custom metric for the F-measure of the positive class, and the pipeline discovered an algorithm (produced a model) that yielded excellent results for the positive class but nearly zero results for the negative class (overfitting the positive class actually). Would a (large) discrepancy of this magnitude make sense to you?
e.g, In comparison to the F-measure of both classes, such as F1-Macro, the distinction is evident. F1-Macro will generate a model with findings that are comparable between classes based on the F1 metric; however, the F-measure for the positive class alone will reveal a significant difference between those classes. I am currently running a lengthy run on the F-measure macro so that I can create a baseline against the F-measure of the positive class in order to provide precise outputs. Nonetheless, even without completing the run, I am confident that the F-measure macro will be superior, which is quite bothersome given that I want the best model to be evaluated based on the F-measure of the positive class (imbalanced binary classification).
Cheers,
Hey @simonprovost,
Unfortunately I can't give good advice here as metrics are something AutoML isn't really aimed at solving, each problem will have it's own problems with metrics. I will point out the use of the scoring_functions
argument to an estimators, i.e., AutoSklearnClassifier
that can help get some other information outside if the optimization metric.
I would probably close the issue here if you don't mind as I believe these questions really go more in the realm of evaluating a model for your own use case, and while interesting, I'm not really equppied to give good advice, especially if it's to be used for medical research reasons.
Best, Eddie
Dear Sir or Madam,
Is any autosklearn.metrics.roc_auc WEIGHTED that is available/possible?
How did this question come about? I am searching for the system that optimises the autosklearn.metrics.roc_auc WEIGHTED metric or Is any autosklearn.metrics.roc auc class label 0 (in a binary use case) metric? Is any of the above feasible? Because class label 1 is assumed to be the default value for this metric, correct? So what if I am looking to optimise roc_auc class label 0 (positive class in my use-case), and or optimise the weighted option of the roc auc metric?
What have you already looked at? I, unfortunately, don't see any here: https://automl.github.io/auto-sklearn/master/api.html?highlight=scoring_functions#built-in-metrics
System Details (if relevant)