Open PGijsbers opened 3 years ago
@PGijsbers Thank you for submitting this issue, for the detail, and for the minimally reproducible example.
It seems that this is an issue when a class is unobserved in any of the cross-validation folds that TPOT generated (by default, it uses StratifiedKFold
with 5 folds to generate the cross-validation splits). sklearn's log_loss
metric will then be passed an array that is missing data for one or more of the classes.
You can reduce the number of folds performed by TPOT so that it is less than the number of instances of the smallest class or create your own cross-fold generator that ensures at least one of each class exists in the data passed when fitting the pipeline for scoring (both of these use the cv
argument when instantiating TPOT). This is irrelevant with non-probability-based metrics (and TPOT will only broadcast the sklearn warnings from StratifiedKFold) as those handle missing classes appropriately.
This is an issue that occurs in native sklearn and is due to how log_loss
handles missing classes (or a lack of handling thereof): see https://github.com/scikit-learn/scikit-learn/issues/11777 and https://github.com/scikit-learn/scikit-learn/issues/15389. We can attempt to write code to handle this on our end, but I'm of the opinion that it's better to leave it up to sklearn to correct these issues and to handle this within their own scoring functions.
In theory, we could eliminate/ignore sparsely-populated classes either in preprocessing or when evaluating pipelines, but as TPOT can otherwise handle cases like this and properly construct and mutate pipelines with most other metrics (for example, if you use the basic accuracy metric), this doesn't seem like the best approach to take without user input and may be something better left to the user to do, as the approach to removing outliers or handling classes with few instances will likely differ significantly based on the meta-features of the input dataset.
It is possible to handle this and use a larger number of folds without modifying the functionality of TPOT or sklearn and maintain the use of the log_loss
metric. One option is to write a custom log_loss
metric that essentially pads the reported probabilities with probabilities of 0 for missing classes before passing them to sklearn's log_loss
. I've written a demo of this below:
from tpot import TPOTClassifier
import numpy as np
from sklearn.metrics import log_loss, make_scorer
x, y = np.random.random((151, 4)), np.asarray([0] * 75 + [1] * 75 + [2])
labels = np.unique(y)
def mod_log_loss(y_true, y_pred, labels):
class_diff = len(labels) - len(y_pred[0])
if(class_diff > 0):
y_pred_pad = np.array([np.pad(x, pad_width=(0,class_diff)) for x in y_pred])
else:
y_pred_pad = y_pred
return(log_loss(y_true, y_pred_pad, labels=labels))
mod_neg_log_loss = make_scorer(mod_log_loss, greater_is_better=False, labels=labels, needs_proba=True)
t = TPOTClassifier(max_time_mins=1, scoring=mod_neg_log_loss)
t.fit(x,y)
t.predict(x)
Note that this demo assumes that the missing classes are the last classes (as it pads at the end of the probability vectors). In theory, you could instead determine which classes are present in y_true
that are missing from labels
(as y_true will be passed from the cross-validation scoring) and pad in those locations instead if the sparsely-populated classes are not the last classes in the dataset, though I have not tested this.
Let us know if you have any thoughts or questions!
Thank you very much for the elaborate response. I was aware of the underlying issue, but I wasn't aware it was a design decision not to address it within TPOT. I understand the decision, feel free to close the issue if desired.
Thank you very much for the elaborate response. I was aware of the underlying issue, but I wasn't aware it was a design decision not to address it within TPOT. I understand the decision, feel free to close the issue if desired.
Admittedly, I'm not sure if we should or shouldn't handle this case as opposed to relying on the user to know the drawbacks of imbalanced data and/or the limits of the metrics they choose. My logic is that processing the data in any way that isn't fully transparent to the user and/or consistent across all cases will be problematic and that it's better to leave it up to the user as to how they want to handle the situation (since there are many options and the best one will likely depend on what the user knows about their data and the importance of the outlier class - for example, in biomedical data, imbalances are common but usually highly important, like in cases where you have extraordinarily rare diseases with few cases against a large number of "control" cases).
That being said, I'll have to talk with the rest of the lab that supports TPOT to see what the best choice might be. Thank you for raising the issue! We'll keep it open for now while we think about the best way to handle this - we may need to be clearer about this in the documentation or keep it in mind for future TPOT extensions/modificiations.
Yes I think it depends entirely on how hands-off you want the automl experience to be and what the expected data science experience of the user is.
A
TPOT.fit
call may fail when there are outlier minority classes (with certain metrics).Context of the issue
When running the benchmark we encountered this issue sometimes, for instance with evaluations on
wine-quality-white
:python runbenchmark.py TPOT openml/t/359974 1h8c -f 6
. Because of TPOT internals, the small minority classes may cause an error when optimizing towards log loss. I reduced the issue to a minimal example:Expected result
I expect a pipeline to be fit regardless, and be able to produce predictions for every class (even if that means with a probability of zero and receiving a warning about it).
Current result
Running the MWE:
Possible fix
Depends on the level you want to fix it on, options include:
scikit-learn
warnings, and also lead to the error)