8080labs / ppscore

Predictive Power Score (PPS) in Python
MIT License
1.1k stars 164 forks source link

Data preprocessing and information leakage #48

Open HectorBarrio opened 3 years ago

HectorBarrio commented 3 years ago

Hello, before anything thanks for the package, it is very useful and the overall approach is innovative and generates a lot of efficiency. I have a comment regarding the "state" of the data to run the pps analysis on, it seems (I may be mistaken) that any transformation to the data (standardization for example) will lead to large data leakage into the Kfolds cross-validations. Is it correct? The module could use sklearn´s pipelining and standard transforms to possibly increase the information generated, would this be of value to the module?

FlorianWetschoreck commented 3 years ago

Hi Hector, thank you for reaching out and for sharing your suggestions. I agree that transformations to the data can lead to data leakage. What is your proposal for adding sklearn pipelining and standard transforms to ppscore?

HectorBarrio commented 3 years ago

Let me try over the week replacing the models, regressor/classifier, with a pipeline model including one standardization step. If this works it can be made a kwarg in predictors.

FlorianWetschoreck commented 3 years ago

I would like to protect your time, so before you start implementing the proposal, please provide a concept (aka some examples) for the API first. This way, we can first discuss the new API (aka user experience) and when we agree on a suitable API, we can talk about the implementation.

HectorBarrio commented 3 years ago

Yes, Florian, minum changes if it works, could be a keyword argument addition for a list of transformations in the predictors function that reaches the VALID_CALCULATIONS dictionaries and replaces tree.DecissionTree*() with a a pipeline preprocessing using the input list of transformations. According to this:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline

FlorianWetschoreck commented 3 years ago

I think I got it - can you still please give 1 detailed example with the actual syntax. I would love to have a look at how the total code would look like

HectorBarrio commented 3 years ago

As example only, in calculation:

from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline

Change the models to:

VALID_CALCULATIONS = { "regression": { "type": "regression", "is_valid_score": True, "model_score": TO_BE_CALCULATED, "baseline_score": TO_BE_CALCULATED, "ppscore": TO_BE_CALCULATED, "metric_name": "mean absolute error", "metric_key": "neg_mean_absolute_error", "model": Pipeline([('scaler', StandardScaler()), ('tree', tree.DecisionTreeRegressor())]), "score_normalizer": _mae_normalizer, }, "classification": { "type": "classification", "is_valid_score": True, "model_score": TO_BE_CALCULATED, "baseline_score": TO_BE_CALCULATED, "ppscore": TO_BE_CALCULATED, "metric_name": "weighted F1", "metric_key": "f1_weighted", "model": Pipeline([('scaler', StandardScaler()), ('tree', tree.DecisionTreeClassifier())]), "score_normalizer": _f1_normalizer, },

This provides slightly different results for some data sets. The idea is to enable "predictors" function to alter the model keys to a constructed pipeline, whose constructor is a little bit awkward as it is a tuple (name, transformer). The pipeline should take care of non-leaking cross validation scores.

A call to predictors would look like:

transformers = [StandardScaler(), MinMaxScaler()] predictors(df, 'column', transformers = transformers)

Here predictors (or other function) would have to create the pipeline list.

FlorianWetschoreck commented 3 years ago

Hi Hector, thank you for the example, and I like the transformers API

When I thought about this proposal, I was unsure which problem this should solve exactly? What is the scenario that the user is in and why does the user use ppscore in that scenario? When did you have this scenario yourself the last time? How did you solve it then? Maybe you can explain this a little bit more - this would help me in my understanding

HectorBarrio commented 3 years ago

Hello Florian, the use case is having feature data that may exhibit outliers, some skewed distribution or any other anomaly that can be improved by transformation instead of dropping the offenders. In this specific case I was looking for best predictors among thousands of time series with several anomalies and had to run transformations, I transformed them and the run PPS, contaminating the internal cross validation. I manually modified the cv PPS uses to time series split and pipelined the data. Users may also want to minmax scale the data, or perform more complex transformations that they could pipeline if they are looking for quick comparisons. There were changes in PPS score ranking with and without the transformations that may be significant.

As a sideline, the cv object could also be defined as kwarg in the predictors functions to accept other splits, stratified k folds comes to mind for very unbalanced datasets.

These are the two operations I had to manually perform in this case, kwarging transformations and cv object can automate it and make the PPS more flexible.

This can generate quick checks, PPS_standard is pps with the pipeline added:

import PPS as pps
import PPS_standard as pps_s
import pandas as pd
import numpy as np
import sklearn.datasets as ds

diabetes = ds.load_diabetes()
df= pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']],
                     columns= diabetes['feature_names'] + ['target'])
print(pps_s.predictors(df, y='target')[['x', 'ppscore']].head())
print(pps.predictors(df, y='target')[['x', 'ppscore']].head())
FlorianWetschoreck commented 3 years ago

Thank you for the explanation. Wouldn't it make more sense then to just pipe the crossvalidation object into ppscore? Because in the end you are concerned about an invalid crossvalidation.

Did you generate a cross-validation object at the end of your pipeline?

HectorBarrio commented 3 years ago

Hello Florian, sklearn pipe requires the last element to be the estimator, which in PPS will be the automated choice of regression or classifier, so I have not found any other way to feed it in but to overwrite the whole model with the pipe that has the original estimator as last element.

The whole pipeline could be and input to PPS, the user would have to decide regression or classification in this case, or complicate the logic of _determine_case_and_prepare_df so that it selects, from multiple models that are either classifiers or regressors. This would allow on the other hand to compare PPS using multiple different models, not only tree.

FlorianWetschoreck commented 3 years ago

Hi Hector, I wish you a good start to the new year and sorry for the late reply - I have been on vacation.

Thank you for clarifying the point that the model is the last step for the crossvalidation object and thus it is not possible to pass the full cv object.

If you want you can go ahead and add a PR

HectorBarrio commented 3 years ago

Happy New Year Florian. I will open the PR and propose the changes.

HectorBarrio commented 3 years ago

Hello Florian, The changes I made require the model (the tree regressor or classifier) within VALID_CALCULATIONS to be re-initialized every time the API is called to include the pipeline object.

This brings no noticeable computational cost but it cannot pass this test:

line 156 of tests: assert pps.score(df, "x", "y", random_seed=1) == pps.score( df, "x", "y", random_seed=1 )

As the model object at the 'model' entry of the dictionary is a different instance of a model with the same parameters. The contents are the same in every other entry of the dict, the model is not and fails to assert. This test (and subsequent result comparisons) could be modified to compare the dict excluding the 'model' entry, just as a suggestion.

FlorianWetschoreck commented 3 years ago

Thank you for the heads-up. We can easily adjust that test