gelijergensen / PermutationImportance

Python package for computing the importance of variables in a model through permutation selection
MIT License
5 stars 4 forks source link

Determining if data is probabilistic #82

Closed monte-flora closed 4 years ago

monte-flora commented 5 years ago

On line 82 in permutation_importance.py, if len(scoring_data[1].shape) > 1 and scoring_data[1].shape[1] > 1: the code attempts to determine if the scoring outputs are probabilistic or not. However, I was confused by the logic. Assuming scoring outputs is a 1-d vector (e.g., [0,1,1, 0,0,0,1]), then len(scoring_data[1].shape) == 1. A possible fix may be if len(np.unique(scoring_data[1]))==2: which instead asks if the output values are binary.

gelijergensen commented 5 years ago

Probabilistic outputs look something like this: [[0.1, 0.8, 0.1], [0.2, 0.8, 0.0]], where here each row contains a normalized array. So the above first checks whether we have a 2-d array and then eliminates things of the form [[0.2], [0.8]], which only contains one item in each row and therefore implicitly means [[0.2, 0.8], [0.8, 0.2]] or something similar.

I did once consider using len(np.unique(.))==2, but I then realized that this would catch the faulty case of only two observed predicted probabilities, e.g. [x1, x2, ...], where xi = [0.0, 0.0, 1.0] or [0.0, 1.0, 0.0]]. So I think it's correct as it is. Do you have a specific example which is doing the wrong thing? It's always possible I didn't think of a use case

monte-flora commented 5 years ago

Well let's clarify, in the documentation `:param scoring_data: a 2-tuple (inputs, outputs) for scoring in the ``scoring_fn``` are the outputs meant to be predictions or the target values?

gelijergensen commented 5 years ago

Outputs is intended to be the target values

monte-flora commented 5 years ago

In that case, the target values should just be a column vector and not a 2-d array, correct? outputs = [ 0, 1, 0, 1, 0, 1, 1, ...] I raise this issue because the logic above did not recognize I was dealing with a column vector of binary outputs and therefore assumed I was making deterministic predictions.

gelijergensen commented 5 years ago

In general, the outputs could be any possible shape. In fact, in my tests, I think I have everything ranging from a None object to a 2-d array. However, for the most common use case, you are right, the target values will be a 1-d array or a vector or something of that shape. Do your outputs look like one of these: outputs = [[0.2], [0.8], [0.1], [0.3]] or outputs = [0.2, 0.8, 0.1, 0.3]

monte-flora commented 5 years ago

The second case. Just a list of binary values.

gelijergensen commented 5 years ago

Okay, I get what's going on. As we are discussing in the other issue (#83), this is a bit of a misnomer, where here probablistic means that the data is of the form [row1, row2, ...], where each row=[prob class a, prob class b, prob class c, ...]. The reason for this is the difference between the predict and predict_proba calls of the underlying sklearn models, which output different things. So while your data is actually probabilistic, it gets fed through my wrapper methods as though it were deterministic. The final evaluation_fn then gets it in exactly the same form and then does something with it. If the scoring function is, for instance sklearn.metrics.roc_auc_score which expects exactly this 1-d list of probabilities, then my code will treat the data in the intermediate stages as "deterministic" but the end function knows that it is really probabilistic and will handle it correctly. Does everything run if you use score_trained_sklearn_model instead of score_trained_sklearn_model_with_probabilities?

monte-flora commented 5 years ago

I should say that everything runs fine. My main problem was the original AUC (no permutations) was coming out as 0.65 when I know it should be closer to 0.9. I assumed the problem was the predict method, which outputs binary predictions rather than probabilistic predictions. So I manually changed the code as I suggested above and the other issue(#83) and got the more anticipated results. The low AUC was also likely associated with issue #80. Ultimately, I fixed the problems on my end but wanted to raise the issues in case the fixes might be helpful for other users.