feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.91k stars 310 forks source link

feat: Gradient Boosting Tree predicted leaf as feature #293

Open TremaMiguel opened 3 years ago

TremaMiguel commented 3 years ago

Is your feature request related to a problem? Please describe.

LightGBM has the option to return the predicted decision tree leaf for every model. From the documentation

If pred_leaf=True, the predicted leaf of every tree for each sample.

Reference

Describe the solution you'd like

A DecisionTreeLeafEncoder method that returns the results from the predict method of lightgbm as a new feature.

solegalli commented 3 years ago

I've heard of this method before. How widely spread its use it (other than in kaggle)?

Things to consider:

Thoughts welcome!

TremaMiguel commented 3 years ago

1. The idea I get from this method is to find relations or interactions between the features, a sample in each leaf would be represented by different characteristic of the variables. An each sample might be represented differently in each tree, so this help finding interactions between group of features.

2.

would users be able to understand what the importance of these features tells them based on the model?

this is a good point, for individual decision trees we can get an idea of why each sample was assigned to a leaf, but to get the trace for every tree would be hard to interpret.

I had no idea feature engine aimed to semi-interpretable methods, I see this more as an experimental feature available for the user to try it or not.

3.

I don't necessarily want the package lightgbm as a dependency

As far as I know only lightgbm implementation can return the predicted leaf. So, lightgbm could be an optional dependency to install, for example pip install feature-engine[extras] or something like that.

solegalli commented 3 years ago

Thanks you!

In its inception, Feature-engine was thought to include methods that you would actually use when creating models to use in real life. My experience, from finance and insurance, is that you need to be able to explain what the model is outputting, and the users o the models, for example the fraud investigators, would like to understand what the feature is telling them. That is why encoding methods like feature-hashing or binary encoding (as in category-encoders) were off the table.

Having said this, I get the impression that users are asking for more alternative techniques, so we could consider whether to include these, but I would say at a later stage, after we give that some thought, maybe we do a user survey or something. I would add more on this in the roadmap.

I would keep this issue on hold for now. And focus on other issues that are more of a priority.

And also, I would like to spend some time looking if something similar could be done with random forests or gbm from sklearn instead of lightgbm

TremaMiguel commented 3 years ago

@solegalli, let's do a survey in Linkedin asking the interest for these methods.