Open ggjj11 opened 2 years ago
Hi @ggjj11,
So I did some digging and I honestly couldn't find any point in which we deal at all with this threshold other than this line when returning predictions for the end user: https://github.com/automl/auto-sklearn/blob/af9d46983c4680b710c79c7714ed0047077d02dc/autosklearn/automl.py#L2355
Edit: Found it in metrics/__init__.py
https://github.com/automl/auto-sklearn/blob/af9d46983c4680b710c79c7714ed0047077d02dc/autosklearn/metrics/__init__.py#L106-L110
As for how ensembling is done, it will use the raw probabilities which is then converted to labels by the metric in the lines above, calculating a final score if that model is to be added to the ensemble.
So to answer your question, no we don't consider it as a hyperparamter and to do so would require quite a lot of changing. In general, our classifiers produced are not well calibrated with respect to their probabilities so trying to tune this threshold as a hyperparamter seems like it would lead to a lot of overfitting.
For example, consider a well calibrated classifier that outputs calibrated probabilties, i.e. .9 means pretty confident of a 1
, .7 means less confident about a 1
and .5 means it really can't tell between 0
or 1
. Contrast this to an uncalbirated classifier which gives .51, translating to a label of 1
but where the .51 doesn't really mean it's more confident than giving a probability of .8 or .9
If we were to tune the threshold to something like .7, in this case we end up with an ensemble that only includes really confident calibrated classifiers and some subsample of uncalibrated classifiers. As we push this threshold higher, the proportion of uncalibrated classifiers would increase and meaning we are reducing diversity to just these classifiers, overfitting on them.
Answering fully, this is a very good thought and thank you for bringing it up. I don't think this is worth doing at this time as we don't explicitly calibrate classifiers. However maybe that's something we should consider, having only calibrated classifiers means we can only select "confident" classifiers for the final ensemble and use the threshold hyperparameter
effectively.
@mfeurer tagging as you may be interested in this.
Best, Eddie
To follow up on @eddiebergman and give some more details
Thanks for bringing this up!
Thank you both for your clarifications about the usage of the threshold in auto-sklearn and how the ensemble is built in detail!
Ensemble selection directly optimizes towards improving the score under the 0.5 threshold
This is interesting since, for a certain user specified metric like e.g. the fscore, only some of the trained models are optimal at the 0.5 threshold. Could it be that if the threshold is also allowed to be optimized (per model), then more models become suitable to be used in the ensemble? If yes, then this could a) increase the performance with regards to the chose metric and b) reduce needed runtime.
PS: For other people reading this question https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/ might be interesting for understanding the importance of the decision threshold parameter.
PPS: I quickly searched (without further relation to ensemble methods) and found mlr which seems to at least tune the threshold for normal classifiers: https://github.com/mlr-org/mlr/issues/856 potentially this is off topic though...
Hi @ggjj11,
Sorry for the delayed response but this seems like a very nice issue to keep open for future work, we may look into this if we get some extra-manpower that is interested in investigating this. Also, I might tag @LennartPurucker as he is doing work on ensemble construction and this may be an interesting angle to look at things from.
Best, Eddie
Hi @eddiebergman,
Thank you very much for adding me! This is indeed a very interesting topic for ensemble building based on probabilities.
Hi @ggjj11,
We can integrate threshold-tuning into the greedy Ensemble Selection (ES) algorithm by Caruana et al. that was mentioned by mfeurer. Furthermore, we can handle most of this by only changing the implementation of the ensemble method.
I tested ES + threshold-tuning on two imbalanced datasets by adding a very basic greedy search for the global best threshold (i.e., not per individual model but per evaluated ensemble) to greedy ensemble selection.
For the adult dataset (class ration = 0.31)
For the APSFailure dataset (class ratio = 0.02)
The main drawback with my basic implementation is that it increases the runtime by the number of evaluated thresholds. I think this could be a useful extension but more sophisticated integrations and tests are needed. In my opinion, it would be most interesting to find out how we can adjust the threshold per base model while integrating it into the ensemble.
Thanks for bringing this up!
Nice to read this open discussion and seeing benchmarks for the idea.
@LennartPurucker picking up your question on how to deal with model dependent thresholds I had the following thoughts.
Having N trained models, we search optimal decision thresholds $t_i$ for each model i. The usual rule which is also applied in sklearn is model_i(x)<t_i (with standard, fixed t_i=0.5) indicating class 0. Allowing for a tuneable $t_i$ could be reformulated (while retaining the decision level at 0.5) by allowing for an offset in the predicted model probability: model_i.predict_proba(x) -> np.clip(model_i.predict_proba(x)+c_i, 0,1) where c_i is a model specific, to be tuned threshold.
[Instead of the clip function (resulting in accumulation of probability somewhere in the predicted probability range), a redistribution of probability by squishing or stretching the predicted probabilities could be beneficial. For simplicity, let me consider c_i>0 (c_i<0 works the same way):
# e.g. c_i=0.2
y'=model_i.predict_proba(x)+c_i
if y'>=0.5:
# e.g. y'=1.2
y'=0.5+(y'-0.5)/(0.5+c_i)*0.5
else:
# e.g. y'=0.2
y'=0.5+(y'-0.5)/(0.5-c_i)*0.5
I guess this will work better for building ensembles than clipping. ]
Example: If model_i would give optimal fscore at threshold t_i = 0.7, then the c_i could be chosen as -0.2 while retaining the the decision level at 0.5: Under the optimal classification rule an example with proba 0.69 would be classified as 0. Computing 0.69-0.2=0.49 would allow for having the same classification while still retaining threshold 0.5.
The optimal shift c_i would depend on the metric like fscore, precision, recall, ... as well as the model.
To keep compute resources low when tuning c_i per model we can leverage that the model does not have to be retrained to tune c_i. Given predicted probabilities and the true class labels y, the optimization for c_i would go like this:
@jonaslandsgesell thank you for the detailed comment and input on this discussion.
Your approach is very reasonable, and it is precisely what I was asking about! Originally, I thought about it from a weights perspective in Ensemble Selection (ES) instead of re-calibrating the probabilities directly based on c_i. Nevertheless, I think not using the weights is more reasonable and agnostic to different ensemble methods. It could function (similar to other probability calibration methods) as a preprocessing step for training ensembles.
To find the best classification threshold automatically, you can try various thresholds for the predicted class probability, let's say from 0.01 to 1.00 (that's 100 values to try), then choose the threshold that maximizes MCC (Matthews Correlation Coefficient). Indeed, having a class probability is enough to rank-order candidates. But, if you are ask to predict labels, you indeed need to optimize the classification threshold. MCC is robust to class imbalance, that's why I like it.
The pull request https://github.com/scikit-learn/scikit-learn/pull/16525 might implement a calibration method for tuning thresholds in sklearn. Would that be applicable for usage in Auto-Sklearn?
Short Question Description
Is auto-sklearn ensemble for classification using predict_proba + threshold adaptions?
I read the original paper and read about the unique way auto-sklearn makes use of a "metric" to select from an ensemble of trained models. These trained models were trained during hyperparaneter optimization and are, therefore rather diverse. When selecting e.g. recall/precision/fscore/... as a metric for the final ensemble, several of these explored models are added to the ensemble based on the performance on a validation set.
Now recall/precision/fscore... depend on the decision threshold (which is typically assumed to be 0.5).
Is auto-sklearn making use of e.g. the precision recall curves of the already during hyperparaneter optimization trained classifiers to identify best performing models?
The threshold is not a typical hyperparaneter like e.g. depth of trees (in decision tree classifiers) etc., but rather a hyperparaneter of more subtle kind. Let me still call it the "threshold hyperparameter" because it affects prediction results. Is this threshold hyperparameter considered during building the best final ensemble? I could not find documentation about how exactly the ensemble building takes place and if the threshold is considered as some pseudo hyperparameter for classification. In any case it would be rather cheap to check if a recall can be improved for a base classifier when we change the decision threshold.
Thank you in advance for the great software!