8080labs / ppscore

Predictive Power Score (PPS) in Python
MIT License
1.12k stars 168 forks source link

Question About Data Order #70

Open JBarsotti opened 2 years ago

JBarsotti commented 2 years ago

Hi!

Great module! I'm using it in a machine learning application right now, and I've noticed that the way that the dataframe is sorted affects the PPS scores. Should this be happening? Intuitively, I don't think so because the order of data shouldn't affect correlation, but I don't know.

Thanks,

John

8080labs commented 2 years ago

Thank you! Can you please share a reproducible example?

Sent from mobile

On 24 Oct 2022, at 17:52, 'John Barsotti' via Admin @.***> wrote:

 Hi!

Great module! I'm using it in a machine learning application right now, and I've noticed that the way that the dataframe is sorted affects the PPS scores. Should this be happening? Intuitively, I don't think so because the order of data shouldn't affect correlation, but I don't know.

Thanks,

John

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

JBarsotti commented 2 years ago

Okay, sorry for the late reply. I had to find an example to share. :)

I can't use the dataset that I'm using in my code because it is currently unpublished work that I don't think my boss would be happy about me sharing :). In that instance, the ppscores vary a lot depending on the shuffled order of the rows of the dataframe. Here is an example with the Boston dataset. The scores don't vary nearly as much, but they do vary a little depending on what the shuffle order is (specified by the "random_state" value in the "shuffle" function).

import pandas as pd
import numpy as np
import ppscore
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
    df['target'] = pd.Series(sklearn_dataset.target)
    return df

df_boston = sklearn_to_df(datasets.load_boston())
df_boston = shuffle(df_boston, random_state = 23)

predictors = [i for i in df_boston.columns if i not in "target"]
pps_train = df_boston[predictors]
pps_train["target"]  = df_boston["target"]

predictors_df = ppscore.predictors(pps_train, y = "target", cross_validation = 5, random_seed = 5)
predictors_df_reduced = predictors_df[predictors_df.ppscore > 0.0]
sns.set(rc = {"figure.figsize": (14.7, 8.27)})
sns.barplot(data = predictors_df_reduced, x = "x", y = "ppscore", dodge = False)
plt.xticks(rotation = 90, fontsize = 8);
JBarsotti commented 2 years ago

Hi! Sorry to bother again. Just wondering if you had a chance to look at this yet. If not, no worries! :)

fwetdb commented 2 years ago

Thank you for providing the example. I just ran the code. I added from sklearn import datasets to make it executable.

Based on the code that you provided, it is not 100% clear to me yet what you are referring to or how I should run the code to get the observation that you had.

However, here is what I think: ppscore is based on cross_validation and cross_validation is based on the underlying order of the dataset because this order determines on which rows the model is trained on and on which rows it is evaluated. Depending on the dataset size and data, this might have a substantial impact. However, in most cases it should not have a strong impact.

However, thinking about this, I think that it is useful to report ppscore with confidence intervals or some other way of telling the variance (not in the literal way) that can be observed when shuffling the dataset or adjusting the crossvalidation.

I am wondering how strong the effects were for your dataset and for columns of which data type in particular? I would expect that this issue might have a significant effect for numeric data that is highly skewed or contains massive outliers because this would effect the MAE which is used as evaluation metric in those cases.

What do you think?

JBarsotti commented 2 years ago

Thank you so much for this response! It makes a lot of sense! In the real dataset that I am using, all of the variables except for the target variable and one other variable are continuous. I have provided three plots below. The first two are from two separate runs of the code using two different data shuffle orders. The variable names are replaced with simple nondescriptive integers. (Note: I only included variables with a PPS score > 0, not 0.) As you can see, many more variables in the first plot are above a PPS score of 0 than in the second. The ordering is also different for some variables.

The third plot is one that I created using the 95% confidence interval you suggested. It's quite interesting. The error bars (the black portions of the bars) are large for some of the top variables. I ran it over 100 iterations of each variable, calculating the PPS score each iteration. (Also, sorry about the tiny x axis.) :(

One other thing is that the non-continuous variable is binary and does not have a large PPS score.

Example_1

Example_2

Error_Bars 198854112-ef491bdd-e331-468b-bdb4-87497dcd5a11.png)

fwetdb commented 2 years ago

Okay, thanks for sharing. To me, the confidence interval seems to be around ~20% around the middle value which seems to be quite constant and results in larger intervals when the absolute value is higher.

I would be curious to see a histogram and a histogram faceted by the target variable for 1 to 3 of the continuous variables. I guess that the shape and plots might tell us something about the variance in the ppscore. Also, how many observations do you have in total?

JBarsotti commented 2 years ago

Thank you so much for the reply!

There are 180 observations but 194 predictor variables! I had been wanting to use PPS score as a method of variable selection (i.e., I would select the variables with PPS scores greater than 0 and then run a machine learning algorithm) but maybe that is not a good way to use PPS score.

Anyway, here are the histograms for the top 3 features by PPS score. That is, these are the top 3 features average across 100 runs of the algorithm. I should also mention that the number of positive (TRUE) classes to negative (FALSE) classes is about 2 to 1.

Highest_PPS

Second_Highest_PPS

Third_Highest_PPS

fwetdb commented 2 years ago

Thank you for sharing the underlying data. Given the low number of observations and the very limited predictive power (it was hard for me to see anything in the graphs), I am not surprised by your findings.

What do you think?

JBarsotti commented 2 years ago

Yes, that makes sense! Would it be fair to say that the PPS doesn't work as well with lots of variables that have only low correlation with the outcome?

fwetdb commented 2 years ago

That's an interesting point. Based on my point of view, the PPS "truthfully" represented what's going on in the data. I especially like the addition of reporting its' variance e.g. via confidence intervals. That is definitely a nice best practice. Thus I would say that it "does work well" (in accordance with my expectations).

However, it seems to me like it does not (or did not) work well based on your expectations. So, I am wondering: what is your expectation/hope that is/was not met? If you expect it to have a single value that never changes even on small sample sizes, then no, it does not deliver that (right now). But then the question is: what alternative does? And what is that alternative better or worse at?

JBarsotti commented 2 years ago

Please don't take my comment as a knock against it! I think it is really cool, and I totally agree with you that reporting confidence intervals is a great idea! I think that it works really well, and the results that I am seeing using it line up very well with the results I see when I look at feature importance using something like Shapley or even just predictive gain. I was surprised to see that it did not provide a consistent ranking of feature importance when rows were shuffled, but I assume this is because of high variance in results across cross-validated folds (which is something I also see using machine learning methods). But I think it is a great tool and plan on continuing to use it in the future (while of course crediting you with its creation). 😄

fwetdb commented 2 years ago

Thank you for your kind words and I don't take it as a knock against it. I am really just curious on your expectations and I like how this conversation brought up the use of confidence interval and a better understanding about the impact of shuffling!

Also, I like how this brings up (implicit) expectations around such tools like PPS, shapley, etc because that might inform the next generation of those :)