Closed desilinguist closed 5 years ago
I think this makes sense. It's not working as it is and it's hacky. Your fix seems like it will fix both of those issues.
Could we include basically the test you have written to ensure that future changes don't break this functionality?
Yup, that's my plan!
Addressed by #551
Back when we wrote an early version of SKLL, we wanted a way to use correlation metrics as tuning objectives for binary classification problems in such a way that we could use the probability of the positive class as the input to the correlation function rather than just the label itself. This should yield a more informative value for the objective.
The way we we achieved this was as follows: before starting the grid search, if we see that we had a correlation metric and a classifier, we monkey-patch the estimator being passed to the
GridSearchCV
instance - replacing itspredict()
method with the custom_predict_binary()
function. When this function is called, it checks whether the classification being effected is binary and if so, it uses the probability of the positive class as the prediction and. If it's not binary, it outputs the most likely label as the prediction.However, this monkey-patching is no longer working as this notebook shows. Basically, even for the binary case, the most likely label is being used when computing pearson.
To fix this, we should no longer need this monkey-patching. We can simply do the following:
If we see that the
probability
option is enabled and that the objective is, say,'pearson'
, we dynamically updateSCORERS['pearson']
to bemake_scorer(pearson, needs_proba=True)
rather thanmake_scorer(pearson)
which is the default value. This will make it such that ourpearson()
function inmetrics.py
will receives probabilities as the estimator predictions if theprobability
configuration option is enabled and the most likely labels as the predictions otherwise.In the
pearson()
function itself, we first check that the array of predictions is 1-dimensional (which will be the case for both binary classification since that's howneeds_proba
works), then we just compute the pearson correlation of the predictions with the true labels. However, if the array of predictions has dimension > 1 (which will be the case for multi-class classification), we just pick the most likely label and use that to compute the pearson correlation with the true label. The latter is because it doesn't make sense to use the probabilities in the multi-class case.With these modifications, all use cases with pearson as the objective function will work seamlessly for the user:
Case 1: binary classification,
probability
option enabled.pearson
function gets dynamically updated to receive probabilities.pearson
function receives the probabilities of the positive class and computes correlation values as expected.Case 2: binary classification,
probability
option not enabled.pearson
function continues to receive only the most likely labels.pearson
function receives the most likely labels and computes correlation values as expected.Case 3: multi-class classification,
probability
option enabled.pearson
function gets dynamically updated to receive probabilities.pearson
function receives a multi-dimensional array of probabilities, infers the most likely label from that, and computes correlation values as expected.Case 4: multi-class classification,
probability
option not enabled.pearson
function continues to receive the most likely labels.pearson
function receives the most likely labels and computes correlation values as expected.Case 5: regression
probability
option does not matter at all since it's a regression.pearson
function continues to receive the regressor predictions.pearson
function receives the regressor prediction and computes correlation values as expected.We can do the same for the other two correlation metrics as well (
spearman
andkendalltau
).@aoifecahill @mulhod @bndgyawali @jbiggsets does all of this make sense?