Open Sandy4321 opened 4 years ago
You can do it like this:
df['comb_cat_feature'] = df['cat_feature_1'].astype('category').cat(df['cat_feature_2'], sep='_')
If you want all possible combos than it will be something like this
list(chain.from_iterable(combinations(s, r) for r in range(2, len(s)+1))) # where s is list of cat columns
and than iterate over this list and use cat
method from pandas
But honestly, there is no need for this. CatBoost, for example, can do this on the fly (combine and encode categorical variables) but they have a lot of variables to control depth of features and memory usage. And by default it does not go deep. And if you increase max_ctr_complexity
you will almost certainly get combinatorial explosion and as a result - OOM.
Also if categorical variable has high cardinality, than you won't be able to save ram at all. For example, if you have more than 256 unique values, than you won't be able to label encode them and downcast to int8 to save memory without loosing information.
@Sandy4321 did you try catboost?
https://github.com/solegalli better to do this implicitly than you know exactly what is happen catbost has option for combining but then it become slow and also we do not know what is happen inside catboos
or
do you mean catboost can do it and returns transitioned data
as per my knowledge they do not return/share transformed data
https://github.com/GLevV
good attempt
but can you share full code with data used?
@Sandy4321 quick question on this discussion. In my experience, I have not seen that we combine all categories they way you mention here, when we are going to use this models in an organisation to score real people. This is mostly because this combinations, turn the new variables a bit difficult to understand.
Why would you be interested in doing so? Could you mention a few examples of its applicability to real life situations?
For example scikit learn polynomial features created for continuous variables, then we need to do the same for categorical
These combinations of categorical features could be interpreted. For example, you have two categorical variables: gender and job name. Their combination is easily interpretable - female_mle, male_devops, female_ceo and so on. It's like grouping dataframe by category and then flattening it. But yes, in practice, there is no need for this shenanigans. The only use case is Kaggle, but as I said above in the case of competition it is done manually: inspecting every categorical feature, their cardinality and usefulness of their combos and only after that writing the code that will transform your dataset. @Sandy4321 I gave you snippets of the code. You just import mentioned functions form python collections lib and run on chosen categorical features. Good luck.
full code needed
for example you have data frame - mydf
and you need to create new data frame newdf with all possible combinations of categorical features
any news?
Hi @Sandy4321
I agree with @GLevV that the use case for this type of variable combinations is mostly for data science competitions. And I am personally, not aware of its use in organisations.
So to prioritise this issue, we would need clear examples of situations, other than data competitions, where this type of variable combinations would be used. For example, a finance use case, or an insurance use case, or any other use case you are working on, with view of deploying the model. Could you expand on that?
Hi @thibaultbl What are your views on this issue?
I agree with @GLevV , most machine learning (or at least for tree-based) do that as a part of their inner algorithm.
But it is also true empirically that you can improve your metrics by using this kind of cross features. Like you said, it is mostly usefull in data science competition. Nevertheless, I think it can be usefull in organisation, if you use this kind of automated feature generation associated with a good feature selection, you can discover some hidden relation that you didn't think about.
My suggestion would be to use something with more human thinking to avoid computing all one-to-one cross feature.
columns = [(col1, col2), (col3, col4)] # 1 - Tuple of columns to combine
for a, b in columns:
df.loc[:, a + b] = df.loc[:, a] + df.loc[:, b] # Bad example, just to get the idea
Thank you! @thibaultbl
It will look something like this
from itertools import combinations
class FeatureCombiner(cols, level=2):
self.cols = cols
self.level = level
# there should be a limit (like 3)
def fit(X)
# need X only to check cols availability
comb = self.level
# or we can take list as an argument, maybe it will be better
self.feature_comb_list = []
while comb != 1:
feature_comb_list += [(x, y) for x, y in combinations(self.cols, comb)]
comb -= 1
def transform(X):
df = X.copy()
for pair in feature_comb_list:
df[f"{pair[0]}_{pair[1]}"] = df[pair[0]].str.cat(df[pair[1]], sep="_")
# if cols are strings, otherwise -> convert to str
return df
But there also need to be a lot of memory/dtype/availability checks. That's why it's easier to do it manually or use something like CatBoost. The number of features will grow exponentially (that's why we need a limit and memory checks) and many of the will be highly correlated.
This transformer from category encoders may do something of the sort: https://contrib.scikit-learn.org/category_encoders/catboost.html
YOU are mistaken nothing even close at this link https://contrib.scikit-learn.org/category_encoders/catboost.html
did you meant PolynomialWrapper as it is written For polynomial target support, see PolynomialWrapper. ??
Is this the transformer you are referring to: https://contrib.scikit-learn.org/category_encoders/polynomial.html?
Otherwise, could you please paste a link with a reference?
Also, please mind our code of conduct for communications through this channel: https://feature-engine.readthedocs.io/en/latest/code_of_conduct.html
seems to it is not for interactoins as stated 1.2.3 Contrast Coding Contrast coding creates a new variable by assigning numeric weights (denoted here as w) to the levels of an ANOVA factor under the constraint that the sum of the weights equals 0
or Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. in https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/#ORTHOGONAL
you are welcome to find true feature interactions library ...
Is your feature request related to a problem? Please describe. if we have categorical features how to created new features by all features combinatoric combination since in real life categorical features are NOT independent , but many of them are dependent from each to others
even scikit learn can not do, but you will?
related to https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook/issues/1 Describe the solution you'd like for example maximum number of combined features is given: or 2 or 4 or 5
for pandas DF you can use concatenation https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python
columns = ['whatever', 'columns', 'you', 'choose'] df['period'] = df[columns].astype(str).sum(axis=1)
so three features combinations from 11 features features combinatoric combination seems to be 3 nested loops are not good for this for i in range(1,11) for j in range(i+1,11) for k in range(j+1,11)
you need to get 165 new features from all combinations (not permutations ) then you get many new features
" Another alternative that I've seen from some Kaggle masters is to join the categories in 2 different variables, into a new categorical variable, so for example, if you have the variable gender, with the values female and male, for observations 1 and 2, and the variable colour with the value blue and green for observations 1 and 2 respectively, you could create a 3rd categorical variable called gender-colour, with the values female-blue for observation 1 and male-green for observation 2. Then you would have to apply the encoding methods from section 3 to this new variable ."
yes do this but it should not be necessary pandas also you need to think about RAM use, since it will be a lot of new features before creating new features think about converting categorical features to "int" types with small amount of digits from numpy ,