PPS is not useful for data with high noise to signal ratio

ag-tcm commented 4 years ago

Note: "Issue" is not the correct tag for this as it is more of a comment, but anyone working with noisy data should consider it.

I'm interest in using PPS for feature selection. I work with data that has a high noise to signal ratio and the PPS score is consistently 0 despite changes to parameters such as sample size and number of cross validation folds. One can easily reproduce this result by changing the "error" term in the example from the "Getting started" section to be uniform -5 to 5 instead of -0.5 to 0.5 and leaving the "x" value as uniform -2 to 2. Does anyone have a similar experience or any insight on the usefulness of PPS for this type of data?

I think a more constructive approach to identifying when a relationship exists between data would be to not RIP correlation. Despite some downsides, it still has its place. Instead, please consider the benefits of using multiple scores that measure relatedness between data when working on your data science projects.

While Pearson correlation measures linear relationships between data, their are several other correlation measures that do not make that assumption and should be examined before abandoning sound mathematical/statistical methods. Spearman correlation measures rank correlation (i.e., does not measure linear relationship) and Kendall's tau (and several variants of it such as Goodman and Kruskal's gamma) measures ordinal association between data with a non-parametric construction. It would be interesting to see how these other correlation measures stack up against PPS in the canonical "0 correlation" scenarios.

Also, on a slightly different topic: One inherent benefit of PPS that its authors have not taken advantage of is its ability to easily be extended to measure predictive power of combinations of features (i.e., interaction terms). The decision tree model used under the hood already supports multiple input features so why not allow "x" to be "x's"? Traditional correlation measures (that I mention above) require the user to assume the form of the interaction between features (is it x1 * x2? x1 / x2? etc.) in order to test the relationship while an extended PPS would not require that assumption. I think this is a very powerful use case for PPS despite its challenge of being visualized in a traditional correlation matrix format.

Sorry for the rant. I do not mean to disparage PPS and the work the authors have done. I think it does have its merits as a practical solution to real world data science problems that we are all facing in our work and with some improvements it can be even more useful.

FlorianWetschoreck commented 4 years ago

Thank you for your message and I think it is a great problem to be discussed which might result in some better understanding of which tools to use in those scenarios.

We agree, that the PPS has a low score in high noise scenarios. Also, the PPS still has troubles in datasets with a few strong numeric outliers because those can distort the "total error mass". And there definitely needs to be special solution in those cases. In addition, the PPS is not stable across linear relationships with different slopes. Those issues still need to be addressed and we already started thinking about those.

Given the high noise scenarios, I would be interested what your real-life scenario is and what you do instead? If the noise is so strong that a cross-validated decision tree cannot outperform the naive predictor at all, I am wondering if there actually is enough signal in your data. And if so, which method is then safe to detect this without being biased by the hopes of the user? In this issue (https://github.com/8080labs/ppscore/issues/28) there are some great examples where the PPS is (near) 0 but classical correlation measures can find some signal and a human user might also find a signal if the signal makes sense BUT the data alone might just not be stable enough (given the cross-validation). The classical measures are finding a relationship but they are actively looking for this relationship and they do not perform cross-validation.

ag-tcm commented 4 years ago

In my case, the high noise data is from financial markets. Analyzing this data with Pearson correlation, Spearman correlation and Kendall tau all give values around 0.1 while PPS gives 0 for the particular dataset I've been working with recently. An interesting idea would be to generate a data set at random that has these statistical properties (i.e., correlation around 0.1) and then test different models applying the PPS methodology to identify those that give more reasonable results. Anecdotally, I've modified the PPS code to use KNN instead of the DecisionTreeRegressor and have found that it gives non-zero PPS scores for my noisy data. I'm planning to try SVR with RBF as well. Perhaps a feature of PPS could be to score multiple values under the hood and then report results for each of them.

My data is probably most similar to the third from the left here.

By the way, what do you think about extending PPS to allow for analyzing multiple x variables at one time against a single y variable?

FlorianWetschoreck commented 4 years ago

If your data comes from financial markets, I would be very skeptical if I can trust non-crossvalidated (backtested) measures like correlation. In particular when checking many variables because eventually you will find a relationship by chance.

I like the approach that you used the KNN as a different model under the hood - that is the intended use of PPS as a framework. And I also like the idea of fitting multiple scores under the hood in the future.

Regarding extending PPS to allow for analyzing multiple x variables at one time, I agree that this makes sense. And the score also generalizes to this scenario. So far, we have been hesitant to do this because a lot of the communication was focused around the bivariate use case. Also, what is the benefit for you of using the PPS with multiple x variables instead of fitting your own sklearn model? the default feature preprocessing, the normalization of the score with the naive model, the easier API, something else?

ag-tcm commented 4 years ago

The use case for using multiple x variables is for preprocessing (and taking advantage of the relative score vs. the naive model). Because we are dealing with a noisy problem in general we want to be careful when adding additional x variables to the master model.

For example we might have a variable x1 which we know has some predictive power for y. Then a new variable x2 is introduced which has no relationship to x1. We'd like to test if x2 can help in our overall model. So we look at various measures of relatedness between x2 and y (using correlation, MIC, PPS). We find that x2 on its own does not seem to predict y. However in our master model (for ex. some tuned ML model) when we use x1 and x2 to predict y it turns out the master model has better predictive power than when only x1 is used!

So it seems the interaction between x1 and x2 uncovers better predictive power. We cannot easily determine the form of the interaction term since our "black box" master model doesn't easily give us that answer. So it would be a difficult exercise to use traditional correlation measures guessing the form of the interaction to identify a predictive setup. Now imagine we have x2 through xn where n might be 100. In this case it would be nice to figure out which of these variables can pair with x1 to improve the model. Using PPS as a preprocessing step to establish which of these variables might be beneficial to include for the master model as a nice preprocessing step. And the best part is that knowledge about the form of the interaction term is not necessary!

Do you think PPS can be used in this way? Is it fair to compare PPS([x1, x2], y) to PPS(x1, y) and say that if PPS([x1, x2], y) > PPS(x1, y) that x2 improves the predictive power?

FlorianWetschoreck commented 4 years ago

Thank you for describing the use case in such detail. And yes, we can say that when PPS([x1, x2], y) > PPS(x1, y) that x2 improves the predictive power and we can even tell by how much when comparing the diff to PPS(x1, y).

So far I am a little bit unsure if we should add such a function because we do not want to spread bad practices. In reality, when trying to perform feature selection for your model, you should perform the feature selection calculations with the model that you actually want to use and using the preprocessing that makes the most sense. Currently, ppscore gives you an easy high-level API with default preprocessings for a Decision Tree. However, in reality you often might want to use another model or other preprocessing.

Of course, we could add support for a wide range of models and maybe even preprocessing, but I am not sure if this effort is worthwhile when users can just do those things with sklearn anyway. Also, it feels to me that this should then be a different project because that was not the original intent / scope of the ppscore.

An argument for adding such a function is that the PPS framework does generalize beyond the bivariate use case given the limitations stated above. And there are exploratory use cases where such a function is valuable. The problem of nudging users to perform wrong feature selection could also be addressed.

Anyway, thank you for starting the discussion

ag-tcm commented 4 years ago

Your points are well taken. Thanks for the productive dialogue and I look forward to the future improvements to PPS.

8080labs / ppscore

PPS is not useful for data with high noise to signal ratio #34