feat: Kmeans Encode Categorical Feature

TremaMiguel commented 3 years ago

Is your feature request related to a problem? Please describe.

The motivating idea for adding cluster labels is that the clusters will break up complicated relationships across features into simpler chunks. Our model can then just learn the simpler chunks one-by-one instead having to learn the complicated whole all at once.

Describe the solution you'd like

A KmeansEncoder transformer that takes as input a set of variables X and return the corresponding cluster label. A basic structure is show below.

class KmeansEncoder:

    def fit(X, y=None):
        self.kmeans_ = KMeans(n_clusters=6)
        self.kmeans_.fit(X)

    def transform(X, y=None):
        X['Cluster'] = self.kmeans_.predict(X)
        return X

solegalli commented 3 years ago

By any chance, is there a link to an article? I've heard of this manipulation for categorical variables before, so I was wondering if we could gather a few more links to understand what else should we consider.

My first thoughts are:

the categorical variables are strings, how would they be encoded before passing them to KMeans?
that if there is a label in the test set, that did not appear in the train set? how would this be handled?

TremaMiguel commented 3 years ago

Hi @solegalli, regarding your points:

1. The transformer by default could encode categorical variable to an ordinal representation or any other transformation , here is an example from Kmodes. But, I think is better to warn the user to encode the categorical variable prior to calling this transformer otherwise we should raise an Error.

2. If there's a train/test skew we should raise this too as an error (perhaps checking unique sample values), but the predict method calculates the closest cluster to this unknown new sample, it should not be able to detect the train/test skew.

I've not found any valuable reference to this method, I think KBinsDiscretizer is similar but in this case the output is only 1 dimensional (the cluster label). This might be a strong argument to not develop this feature, as Kbinsdiscretezer already exists.

solegalli commented 3 years ago

I've not found any valuable reference to this method, I think KBinsDiscretizer is similar but in this case the output is only 1 dimensional (the cluster label). This might be a strong argument to not develop this feature, as Kbinsdiscretezer already exists.

I think the idea of this technique is to take a group of categorical variables and replace them with kmeans. Otherwise, it doesn't make much sense, because the clusters will be arbitrarily determined by the numbers used to replace the categories.

I think there is some value here. But I would like to gather some more references before we proceed.

TremaMiguel commented 3 years ago

The book Feature Engineering for Machine Learning describe the method as K-means Featurization, it already provides some code example.

import numpy as np
>>> from sklearn.cluster import KMeans
>>> class KMeansFeaturizer:
... """Transforms numeric data into k-means cluster memberships.
...
... This transformer runs k-means on the input data and converts each data point
... into the ID of the closest cluster. If a target variable is present, it is
... scaled and included as input to k-means in order to derive clusters that
... obey the classification boundary as well as group similar points together.
... """
...
... def __init__(self, k=100, target_scale=5.0, random_state=None):
... self.k = k
... self.target_scale = target_scale
... self.random_state = random_state
...
... def fit(self, X, y=None):
... """Runs k-means on the input data and finds centroids.
... """
... if y is None:
... # No target variable, just do plain k-means
... km_model = KMeans(n_clusters=self.k,
... n_init=20,
... random_state=self.random_state)
... km_model.fit(X)
...
... self.km_model_ = km_model
... self.cluster_centers_ = km_model.cluster_centers_
... return self
...
... # There is target information. Apply appropriate scaling and include
... # it in the input data to k-means.
... data_with_target = np.hstack((X, y[:,np.newaxis]*self.target_scale))
...
... # Build a pre-training k-means model on data and target
... km_model_pretrain = KMeans(n_clusters=self.k,
... n_init=20,
... random_state=self.random_state)
... km_model_pretrain.fit(data_with_target)
...
... # Run k-means a second time to get the clusters in the original space
... # without target info. Initialize using centroids found in pre-training.
... # Go through a single iteration of cluster assignment and centroid
... # recomputation.
... km_model = KMeans(n_clusters=self.k,
... init=km_model_pretrain.cluster_centers_[:,:2],
... n_init=1,
... max_iter=1)
... km_model.fit(X)
...
... self.km_model = km_model
... self.cluster_centers_ = km_model.cluster_centers_
... return self
...
... def transform(self, X, y=None):
... """Outputs the closest cluster ID for each input data point.
... """
... clusters = self.km_model.predict(X)
... return clusters[:,np.newaxis]
...
... def fit_transform(self, X, y=None):
... self.fit(X, y)
... return self.transform(X, y)

Same as with #292 as there's code example they could be a good starting point.

solegalli commented 3 years ago

Thanks for adding in the reference to the book.

The author (of the book you linked in previous comment) mentions that K-mean featurization makes sense on variables where the euclidean distance "makes sense", so should not be applied on categorical variables (page 130 in my version).

If there are categorical variables, she suggests to convert to binning statistics, using the method which she calls in her book "Bin Counting" . That could be one option. To do this, we would first have to create a BinCounting encoder for categorical variables (we would need to create a separate issue for this). Maybe that is a first step for this issue.

There are however, other ways of applying kmeans to categorical data, a summary in this article. There is quite a bit on stackexchange and the internet in general, about how to apply kmeans to categorical variables, but I did not have time o look in more detail.

We need to think on how best to design this transformer based on all this info. A clearer view of which encoding the transformer would offer and what are the pros and limitations. Maybe the transformer should offer different encoding possibilities.

Also, for this transformer, I am thinking of a new module: embedding, since in essence, we will be replacing all categorical variables by a completely new representation. Not sure encoding is the right place. But this deserves more thought. What other transformers would go in this new module?

solegalli commented 3 years ago

Reading the book Feature Engineering for Machine Learning from Alice Zhang further Bin counting is in essence target mean encoding, as per example number of clicks / sum (click + non-clicks). So we could actually offer the user the possibility to choose the categorical encoding to use first, ahead of the kmeans featurization.

glevv commented 2 years ago

From the looks of it, it is the same as sklearn.preprocessing.KBinsDiscretizer with strategy='kmeans'

solegalli commented 2 years ago

Hi @GLevV

I didn't check the source code of the KBinsDiscretizer when strategy is kmeans, but according to the docs: "‘kmeans’ strategy defines bins based on a k-means clustering procedure performed on each feature independently". Thus, it would map:

n features -> n clusters
n features -> n features x n clusters if ohe.

The proposition here is to use all categorical variables to derive the clusters. So it would map from:

n features to 1 feature if we choose to return the cluster number
or from n features to k features if we one hot encode
or from n features to k features if we chose to return distances

I don't think this is what the KBinsDiscretizer is doing. Am I correct?

glevv commented 2 years ago

@solegalli sorry, my bad, I misunderstood OP. It is transformation of k features to 1. So it just regular fit_predict of KMeans on subset of data. Which is something like this: ColumnTransformer([('kmeans', KMeans(n_clusters=6), [ids of categorical columns])], remainder='passthrough')

Morgan-Sell commented 1 year ago

@solegalli and @glevv,

Just doing some housekeeping. Given that sklearn has this functionality using ColumnTransformer, does this mean we should subordinate this task?

Do we have a "low priority" label for Git? Should we add the label to the appropriate tasks, e.g. this one?

solegalli commented 1 year ago

I am personally no great fan of the ColumnTransformer. It changes variable names and order.

We should create our own transformer that applies kmeans to the selected categorical variables, after using one of the available encoders, to return the clusters to replace the original variables.

So here we are replacing a group of categorical variables by a group of clusters.

The logic:

Step 1: encode categorical variables using, OHE, ordinal, target mean, to be decided by the user via a parameter Step 2: kmeans (n clusters to be decided by user) Step 3: remove the original variables and add the clusters instead either as an ordinal variable or as ohe (each cluster is a variable with 1 or 0), user decides what they want.

Thank you!

feature-engine / feature_engine

feat: Kmeans Encode Categorical Feature #291