8080labs / ppscore

Predictive Power Score (PPS) in Python
MIT License
1.12k stars 168 forks source link

add option to select features and targets for computing matrix #14

Closed SuryaThiru closed 4 years ago

SuryaThiru commented 4 years ago

With ppscore taking a considerable amount of computation time, I thought this would be a handy feature for people working on very wide datasets.

added some simple tests to test_matrix in test_calculation.py

8080labs commented 4 years ago

Hi Surya, thank you for taking the time to create the PR and thinking about how to improve ppscore. Here are my thoughts about the enhanced proposal:

Therefore, I think that the value add is not big enough.

Do you have any other thoughts?

Thank you, Florian

SuryaThiru commented 4 years ago

Yes, I guess it is not a trivial case. It will be useful if we wish to identify strong predictors for one or more targets. While typical correlation matrices might not reveal some associations, ppscore does a good job in identifying these. So, for applications like feature selection, this might be useful where we avoid computing inter-predictive power and writing loops similar to ones already in pps.matrix. I think more people will find themselves using ppscore for such applications.

8080labs commented 4 years ago

I agree. I think, in this case, we should rather add a pps.predictors(df, target) method. And if the user wants to inspect the features for multiple targets, she can just call this method for all target columns. This is also what has been requested in #13

SuryaThiru commented 4 years ago

Fair enough. I can move the implementation to a pps.predictors(df, target) method.

SuryaThiru commented 4 years ago

I am thinking of making the method return a dataframe of scores with the feature names as the index, and a "score" column. It can have the sort option like you initially planned (predictors(df, y, task=None, sorted=True)). What do you think?

8080labs commented 4 years ago

That sounds good. I think it should behave similar to the matrix: either return a df with the scores or return a list of the scores dict. Sorted should be a boolean that is True per default. Also, it should accept **kwargs to pass through to the single score method.

What do you think about it?

SuryaThiru commented 4 years ago

Yep. That sounds good.

8080labs commented 4 years ago

Great, looking forward to your PR :)

tkrabel commented 4 years ago

@SuryaThiru I think our discussion moved to PR #17. Can we close this PR?