add option to select features and targets for computing matrix

8080labs / ppscore

Predictive Power Score (PPS) in Python

MIT License

1.12k stars 168 forks source link

add option to select features and targets for computing matrix #14

Closed SuryaThiru closed 4 years ago

SuryaThiru commented 4 years ago

With ppscore taking a considerable amount of computation time, I thought this would be a handy feature for people working on very wide datasets.

added some simple tests to test_matrix in test_calculation.py

8080labs commented 4 years ago

Hi Surya, thank you for taking the time to create the PR and thinking about how to improve ppscore. Here are my thoughts about the enhanced proposal:

regarding the ys: when calculating a matrix, the expectation is that every column acts as both the target and the feature. Hence, it does not make sense to limit the number of targets
when we dont use ys and just want to limit the number of columns via xs, we can do this directly on the dataframe via df[xs] and don't need to add a new method for this

Therefore, I think that the value add is not big enough.

Do you have any other thoughts?

Thank you, Florian

SuryaThiru commented 4 years ago

Yes, I guess it is not a trivial case. It will be useful if we wish to identify strong predictors for one or more targets. While typical correlation matrices might not reveal some associations, ppscore does a good job in identifying these. So, for applications like feature selection, this might be useful where we avoid computing inter-predictive power and writing loops similar to ones already in pps.matrix. I think more people will find themselves using ppscore for such applications.

8080labs commented 4 years ago

I agree. I think, in this case, we should rather add a pps.predictors(df, target) method. And if the user wants to inspect the features for multiple targets, she can just call this method for all target columns. This is also what has been requested in #13

SuryaThiru commented 4 years ago

Fair enough. I can move the implementation to a pps.predictors(df, target) method.

SuryaThiru commented 4 years ago

I am thinking of making the method return a dataframe of scores with the feature names as the index, and a "score" column. It can have the sort option like you initially planned (predictors(df, y, task=None, sorted=True)). What do you think?

8080labs commented 4 years ago

That sounds good. I think it should behave similar to the matrix: either return a df with the scores or return a list of the scores dict. Sorted should be a boolean that is True per default. Also, it should accept **kwargs to pass through to the single score method.

What do you think about it?

SuryaThiru commented 4 years ago

Yep. That sounds good.

8080labs commented 4 years ago

Great, looking forward to your PR :)

tkrabel commented 4 years ago

@SuryaThiru I think our discussion moved to PR #17. Can we close this PR?