8080labs / ppscore

Predictive Power Score (PPS) in Python
MIT License
1.12k stars 168 forks source link

PPSScore drop rows with missing: can be improved. #21

Open xiaodaigh opened 4 years ago

xiaodaigh commented 4 years ago

Can missing be treated as a separate category? I see errors sometimes as "after dropping missing no valid rows are there".

I see in the doc "All rows which have a missing value in the feature or the target column are dropped"

this is not desirable as missingness maybe a predictive factor which lightgbm and xgboost can handle.

8080labs commented 4 years ago

Thank you for bringing this up and I agree that it is desirable to also handle missing values out of the box.

Do you have some suggestions on how to adjust the concept/implementation in order to handle the missing values?

xiaodaigh commented 4 years ago

Use xgboost or Lightgbm like implementations

8080labs commented 4 years ago

In general, it is possible of course to use another learner under the hood as part of the PPS framework. So, I am wondering what other implications in terms of overfitting and calculation time this might have. Also, it will be interesting to see some details about how the algorithms handle the missing values and to which scores this leads - especially when there are missing values in the feature and/or the target.

Do you want to spend some time yourself working on this issue or did you only want to request it?

xiaodaigh commented 4 years ago

Truth is we debated within our team whether to use ppscore but decided against it. We developed our own algo using Lightgbm.

Main reason was inability to handle missing. No intention of working on it

8080labs commented 4 years ago

Interesting, so you basically implemented the same concept/calculation of the PPS but used Lightgbm?

xiaodaigh commented 4 years ago

In fact, we had implemented a sklearn random forest-based one(this is before ppsscore even existed I think) and it was too slow and didn't handle missing.

Then we researched alternatives and saw your medium post, so we assess ppsscore as well. But decided to implement our own.

FlorianWetschoreck commented 4 years ago

Ok, so this sounds that your primary goal was creating a model in order to create predictions. And your focus was not to create a score that you would use for further data exploration. Is that correct?

xiaodaigh commented 4 years ago

primary goal was creating a model in order to create predictions.

True. But we also use feature importance to guide some suggestions.