Correlation metrics not working as expected for binary classification

desilinguist commented 5 years ago

Back when we wrote an early version of SKLL, we wanted a way to use correlation metrics as tuning objectives for binary classification problems in such a way that we could use the probability of the positive class as the input to the correlation function rather than just the label itself. This should yield a more informative value for the objective.

The way we we achieved this was as follows: before starting the grid search, if we see that we had a correlation metric and a classifier, we monkey-patch the estimator being passed to the GridSearchCV instance - replacing its predict() method with the custom _predict_binary() function. When this function is called, it checks whether the classification being effected is binary and if so, it uses the probability of the positive class as the prediction and. If it's not binary, it outputs the most likely label as the prediction.

However, this monkey-patching is no longer working as this notebook shows. Basically, even for the binary case, the most likely label is being used when computing pearson.

To fix this, we should no longer need this monkey-patching. We can simply do the following:

If we see that the probability option is enabled and that the objective is, say, 'pearson', we dynamically update SCORERS['pearson'] to be make_scorer(pearson, needs_proba=True) rather than make_scorer(pearson) which is the default value. This will make it such that our pearson() function in metrics.py will receives probabilities as the estimator predictions if the probability configuration option is enabled and the most likely labels as the predictions otherwise.
In the pearson() function itself, we first check that the array of predictions is 1-dimensional (which will be the case for both binary classification since that's how needs_proba works), then we just compute the pearson correlation of the predictions with the true labels. However, if the array of predictions has dimension > 1 (which will be the case for multi-class classification), we just pick the most likely label and use that to compute the pearson correlation with the true label. The latter is because it doesn't make sense to use the probabilities in the multi-class case.

With these modifications, all use cases with pearson as the objective function will work seamlessly for the user:

Case 1: binary classification, `probability` option enabled.

before grid-search: pearson function gets dynamically updated to receive probabilities.
during grid-search: pearson function receives the probabilities of the positive class and computes correlation values as expected.

Case 2: binary classification, `probability` option not enabled.

before grid-search: pearson function continues to receive only the most likely labels.
during grid-search: pearson function receives the most likely labels and computes correlation values as expected.

Case 3: multi-class classification, `probability` option enabled.

before grid-search: pearson function gets dynamically updated to receive probabilities.
during grid-search: pearson function receives a multi-dimensional array of probabilities, infers the most likely label from that, and computes correlation values as expected.

Case 4: multi-class classification, `probability` option not enabled.

before grid-search: pearson function continues to receive the most likely labels.
during grid-search: pearson function receives the most likely labels and computes correlation values as expected.

Case 5: regression

the probability option does not matter at all since it's a regression.
before grid-search: pearson function continues to receive the regressor predictions.
during grid-search: pearson function receives the regressor prediction and computes correlation values as expected.

We can do the same for the other two correlation metrics as well (spearman and kendalltau).

@aoifecahill @mulhod @bndgyawali @jbiggsets does all of this make sense?

mulhod commented 5 years ago

I think this makes sense. It's not working as it is and it's hacky. Your fix seems like it will fix both of those issues.

mulhod commented 5 years ago

Could we include basically the test you have written to ensure that future changes don't break this functionality?

desilinguist commented 5 years ago

Yup, that's my plan!

desilinguist commented 5 years ago

Addressed by #551

EducationalTestingService / skll