Closed monte-flora closed 5 years ago
No, I think it's correct as it is, though it might be a bit confusing. Taking a look here (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), the difference between predict
and predict_proba
is that predict
produces a list [0, 1, 2, 0, 2, 1, ...]
, whereas predict_proba
provides more information, e.g. [ [0.9, 0.1, 0.0], [0.4, 0.5, 0.1], [0.2, 0.2, 0.6], [1.0, 0.0, 0.0], [0.1, 0.3, 0.6], [0.1, 0.6, 0.3], ...]
. The difference is that when we are using a probabilistic model, we want the evaluation function to have access to all that information, rather than just the probability of the first class, as that would return. Especially if we have a 3-class problem, then we need the full information, rather than just the one column.
If you take a look at the metrics which I provide, you'll notice that I handle both possibilities here in this function https://github.com/gelijergensen/PermutationImportance/blob/40737540bba9fe284466fb3e652a4f21dd42e1d5/PermutationImportance/metrics.py#L64.
The reason I raise the issue is that for evaluating the AUC (using sklearn.metrics.roc_auc_score
), an error was raised since the predictions need to be a 1-d vector ( e.g., just the probability of the positive class).
I see. And you're wanting to use it with a model which has a predict_proba
method or just a predict
method? In the later case, you can just use the model the other way (i.e. the "deterministic" way), because it'll call predict
under the hood and then pass that 1-d vector along to the AUC scoring function. That's why having called it "deterministic" vs. "probabilistic" was probably a bit of a misnomer on my part. The model itself can produce 2 class probabilistic outputs and it'll technically be a "deterministic" model as far as the wrapper methods here are concerned.
If your model is using a predict_proba
method, then we'll need to make a little wrapper around sklearn.metrics.roc_auc_score
which grabs only the column that you care about
Sounds good. The primary issue I was encountering was that predictions from the predict
method produce an erroneously low AUC as compared to the predictions from the predict_proba[:,1]
method.
Oh, hmm... that is weird. In any case, you could try using this wrapper function around the sklearn.metrics.roc_auc_score
:
def grab_single_class_probabilities(f):
def wrapped_function(truths, predictions):
return f(truths[:,1], predictions[:,1])
return wrapped_function
And then you can simply write grab_single_class_probabilities(roc_auc_score)
instead of roc_auc_score
or something. It's maybe not the cleanest (and you may not need to modify the truths as well as the predictions in the internal call), but I think it should work and would allow you to see if the AUC is still oddly low
On line 56 in sklearn_api.py, you have the following
return model.predict_proba(scoring_inputs)
but you may want:return model.predict_proba(scoring_inputs)[:,1]
to predict probabilities for the evaluation function.