Open Pacman1984 opened 2 years ago
Oh yes, I love well crafted feature requests, with plenty of links and examples :)
Thank you so much for that.
I've seen this technique before. In fact, it was explained, if I remember correctly, in a course in coursera called "How to Win a Data Science Competition: Learn from Top Kagglers". But I don't seem to be able to find the course any more.
I don't seem to find the class ColumnsConcatenation in the link to your repo, either.
That aside, there are a few things to consider for a class like this:
How would we tackle the above issues?
I am happy to bring that transformer here. This transformer creates new features, so it should probably go in the "creation" module, and inside that module, we should probable create a new module called "categorical" to differentiate it from combining numerical features.
It would be good to discuss a bit about how the class would tackle the above issues. Do you have some thoughts on that? I'd be keen to hear :)
Thanks for the suggestion.
Hi,
the code is in the ini .py file Link:
Regarding your considerations:
Thanks for your effort of maintaining
Yes, it's a rerun of a #84 If someone has an implementation of this, it should be tested on OOM issues as well as speed first, since there could be some problems
Sorry for the late response @Pacman1984 for some reason I only noticed your answer after @GLevV added his comment :/
I guess we could draft an implementation of this transformer and then run some speed tests?
How does long does the transformation take as the number of variables to combine and/or the cardinality increases?
And from there, we can make some decisions on how we limit the functionality of the class, ie, to certain low cardinality values and / or certain maximum variables to combine.
What do you think?
I can make a pull request next week.
Expected Behavior
Concatenating categorical variables is a powerful feature engineering technique, often used in competitions. You could watch the 9 minuts of this video for understanding the topic (i placed the video on the right starting minute already): Winning Solution --> RecSys 2020 Tutorial: Feature Engineering for Recommender Systems,
Categorical Variable Concatenation is not implemented in scikit-learn or scikit-learn-contrib packages.
I have coded this feature in a separate repo catcomb and would implement this solution in feature_engine, if you agree.
Basically, what it does, it concatenates all categorical columns with each other based on some parameters you can chose.
Its a ColumnsTransformer where you can choose
columns: if the categorical columns should be infered automatically, or the column names should be given
level: the level of 'combinatorial' deepness
max_cardinality: the maximal cardinality of the original columns which should be included in the concatenation process, because very high cardinality concatination isn't that useful
Example
pipe = Pipeline([("catcomb", ColumnsConcatenation(columns='auto', level=2, max_cardinality=500))])
I also posted this issue on scikit-lego and category_encoders, but maybe its better implemented here. The maintainers of the two other packages were uncertain if their repos are the right ones for this feature. So i think feature engine is pretty close to a perfect fit for this feature.