Open xiaodaigh opened 4 years ago
Thank you for bringing this up and I agree that it is desirable to also handle missing values out of the box.
Do you have some suggestions on how to adjust the concept/implementation in order to handle the missing values?
Use xgboost or Lightgbm like implementations
In general, it is possible of course to use another learner under the hood as part of the PPS framework. So, I am wondering what other implications in terms of overfitting and calculation time this might have. Also, it will be interesting to see some details about how the algorithms handle the missing values and to which scores this leads - especially when there are missing values in the feature and/or the target.
Do you want to spend some time yourself working on this issue or did you only want to request it?
Truth is we debated within our team whether to use ppscore but decided against it. We developed our own algo using Lightgbm.
Main reason was inability to handle missing. No intention of working on it
Interesting, so you basically implemented the same concept/calculation of the PPS but used Lightgbm?
In fact, we had implemented a sklearn random forest-based one(this is before ppsscore even existed I think) and it was too slow and didn't handle missing.
Then we researched alternatives and saw your medium post, so we assess ppsscore as well. But decided to implement our own.
Ok, so this sounds that your primary goal was creating a model in order to create predictions. And your focus was not to create a score that you would use for further data exploration. Is that correct?
primary goal was creating a model in order to create predictions.
True. But we also use feature importance to guide some suggestions.
Can missing be treated as a separate category? I see errors sometimes as "after dropping missing no valid rows are there".
I see in the doc "All rows which have a missing value in the feature or the target column are dropped"
this is not desirable as missingness maybe a predictive factor which lightgbm and xgboost can handle.