gelijergensen / PermutationImportance

Python package for computing the importance of variables in a model through permutation selection
MIT License
5 stars 4 forks source link

Predicting probabilities #83

Closed monte-flora closed 5 years ago

monte-flora commented 5 years ago

On line 56 in sklearn_api.py, you have the following return model.predict_proba(scoring_inputs) but you may want: return model.predict_proba(scoring_inputs)[:,1] to predict probabilities for the evaluation function.

gelijergensen commented 5 years ago

No, I think it's correct as it is, though it might be a bit confusing. Taking a look here (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), the difference between predict and predict_proba is that predict produces a list [0, 1, 2, 0, 2, 1, ...], whereas predict_proba provides more information, e.g. [ [0.9, 0.1, 0.0], [0.4, 0.5, 0.1], [0.2, 0.2, 0.6], [1.0, 0.0, 0.0], [0.1, 0.3, 0.6], [0.1, 0.6, 0.3], ...]. The difference is that when we are using a probabilistic model, we want the evaluation function to have access to all that information, rather than just the probability of the first class, as that would return. Especially if we have a 3-class problem, then we need the full information, rather than just the one column.

If you take a look at the metrics which I provide, you'll notice that I handle both possibilities here in this function https://github.com/gelijergensen/PermutationImportance/blob/40737540bba9fe284466fb3e652a4f21dd42e1d5/PermutationImportance/metrics.py#L64.

monte-flora commented 5 years ago

The reason I raise the issue is that for evaluating the AUC (using sklearn.metrics.roc_auc_score), an error was raised since the predictions need to be a 1-d vector ( e.g., just the probability of the positive class).

gelijergensen commented 5 years ago

I see. And you're wanting to use it with a model which has a predict_proba method or just a predict method? In the later case, you can just use the model the other way (i.e. the "deterministic" way), because it'll call predict under the hood and then pass that 1-d vector along to the AUC scoring function. That's why having called it "deterministic" vs. "probabilistic" was probably a bit of a misnomer on my part. The model itself can produce 2 class probabilistic outputs and it'll technically be a "deterministic" model as far as the wrapper methods here are concerned.

If your model is using a predict_proba method, then we'll need to make a little wrapper around sklearn.metrics.roc_auc_score which grabs only the column that you care about

monte-flora commented 5 years ago

Sounds good. The primary issue I was encountering was that predictions from the predict method produce an erroneously low AUC as compared to the predictions from the predict_proba[:,1] method.

gelijergensen commented 5 years ago

Oh, hmm... that is weird. In any case, you could try using this wrapper function around the sklearn.metrics.roc_auc_score:

def grab_single_class_probabilities(f):
    def wrapped_function(truths, predictions):
        return f(truths[:,1], predictions[:,1])
    return wrapped_function

And then you can simply write grab_single_class_probabilities(roc_auc_score) instead of roc_auc_score or something. It's maybe not the cleanest (and you may not need to modify the truths as well as the predictions in the internal call), but I think it should work and would allow you to see if the AUC is still oddly low