feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.85k stars 308 forks source link

[FEATURE] Categorical Variable Concatenation #411

Open Pacman1984 opened 2 years ago

Pacman1984 commented 2 years ago

Expected Behavior

Concatenating categorical variables is a powerful feature engineering technique, often used in competitions. You could watch the 9 minuts of this video for understanding the topic (i placed the video on the right starting minute already): Winning Solution --> RecSys 2020 Tutorial: Feature Engineering for Recommender Systems,

Categorical Variable Concatenation is not implemented in scikit-learn or scikit-learn-contrib packages.

I have coded this feature in a separate repo catcomb and would implement this solution in feature_engine, if you agree.

Basically, what it does, it concatenates all categorical columns with each other based on some parameters you can chose.

image

Its a ColumnsTransformer where you can choose

Example pipe = Pipeline([("catcomb", ColumnsConcatenation(columns='auto', level=2, max_cardinality=500))])

I also posted this issue on scikit-lego and category_encoders, but maybe its better implemented here. The maintainers of the two other packages were uncertain if their repos are the right ones for this feature. So i think feature engine is pretty close to a perfect fit for this feature.

solegalli commented 2 years ago

Oh yes, I love well crafted feature requests, with plenty of links and examples :)

Thank you so much for that.

I've seen this technique before. In fact, it was explained, if I remember correctly, in a course in coursera called "How to Win a Data Science Competition: Learn from Top Kagglers". But I don't seem to be able to find the course any more.

I don't seem to find the class ColumnsConcatenation in the link to your repo, either.

That aside, there are a few things to consider for a class like this:

  1. with highly cardinal categorical variables, the feature space will explode and will potentially take huge amount of time to compute all new features.
  2. the test set, or future data (not used in fit) could have new categories.

How would we tackle the above issues?

I am happy to bring that transformer here. This transformer creates new features, so it should probably go in the "creation" module, and inside that module, we should probable create a new module called "categorical" to differentiate it from combining numerical features.

It would be good to discuss a bit about how the class would tackle the above issues. Do you have some thoughts on that? I'd be keen to hear :)

Thanks for the suggestion.

Pacman1984 commented 2 years ago

Hi,

the code is in the ini .py file Link:

image

Regarding your considerations:

  1. I have already implemented a strategy to only consider categorical variables with a cardinality < threshold (look at the max_cardinality parameter)
  2. The transform function will not consider any column that is not available already in the fit function (already implemented). So columns only available in a testset would be ignored by the concatination. New categories in an already existing column would be handled afterwards with the next transformer, because the generated new columns must be transformed e.g. by one-hot encoder, target-encoder etc.... I think the numerical transformation should handle this, or am i wrong?

Thanks for your effort of maintaining

glevv commented 2 years ago

Yes, it's a rerun of a #84 If someone has an implementation of this, it should be tested on OOM issues as well as speed first, since there could be some problems

solegalli commented 2 years ago

Sorry for the late response @Pacman1984 for some reason I only noticed your answer after @GLevV added his comment :/

I guess we could draft an implementation of this transformer and then run some speed tests?

How does long does the transformation take as the number of variables to combine and/or the cardinality increases?

And from there, we can make some decisions on how we limit the functionality of the class, ie, to certain low cardinality values and / or certain maximum variables to combine.

What do you think?

Pacman1984 commented 2 years ago

I can make a pull request next week.

solegalli commented 1 year ago

https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-pandas-dataframe