Open TremaMiguel opened 3 years ago
By any chance, is there a link to an article? I've heard of this manipulation for categorical variables before, so I was wondering if we could gather a few more links to understand what else should we consider.
My first thoughts are:
Hi @solegalli, regarding your points:
1. The transformer by default could encode categorical variable to an ordinal representation or any other transformation , here is an example from Kmodes. But, I think is better to warn the user to encode the categorical variable prior to calling this transformer otherwise we should raise an Error.
2. If there's a train/test skew we should raise this too as an error (perhaps checking unique sample values), but the predict method calculates the closest cluster to this unknown new sample, it should not be able to detect the train/test skew.
I've not found any valuable reference to this method, I think KBinsDiscretizer is similar but in this case the output is only 1 dimensional (the cluster label). This might be a strong argument to not develop this feature, as Kbinsdiscretezer already exists.
I've not found any valuable reference to this method, I think KBinsDiscretizer is similar but in this case the output is only 1 dimensional (the cluster label). This might be a strong argument to not develop this feature, as Kbinsdiscretezer already exists.
I think the idea of this technique is to take a group of categorical variables and replace them with kmeans. Otherwise, it doesn't make much sense, because the clusters will be arbitrarily determined by the numbers used to replace the categories.
I think there is some value here. But I would like to gather some more references before we proceed.
The book Feature Engineering for Machine Learning describe the method as K-means Featurization, it already provides some code example.
import numpy as np
>>> from sklearn.cluster import KMeans
>>> class KMeansFeaturizer:
... """Transforms numeric data into k-means cluster memberships.
...
... This transformer runs k-means on the input data and converts each data point
... into the ID of the closest cluster. If a target variable is present, it is
... scaled and included as input to k-means in order to derive clusters that
... obey the classification boundary as well as group similar points together.
... """
...
... def __init__(self, k=100, target_scale=5.0, random_state=None):
... self.k = k
... self.target_scale = target_scale
... self.random_state = random_state
...
... def fit(self, X, y=None):
... """Runs k-means on the input data and finds centroids.
... """
... if y is None:
... # No target variable, just do plain k-means
... km_model = KMeans(n_clusters=self.k,
... n_init=20,
... random_state=self.random_state)
... km_model.fit(X)
...
... self.km_model_ = km_model
... self.cluster_centers_ = km_model.cluster_centers_
... return self
...
... # There is target information. Apply appropriate scaling and include
... # it in the input data to k-means.
... data_with_target = np.hstack((X, y[:,np.newaxis]*self.target_scale))
...
... # Build a pre-training k-means model on data and target
... km_model_pretrain = KMeans(n_clusters=self.k,
... n_init=20,
... random_state=self.random_state)
... km_model_pretrain.fit(data_with_target)
...
... # Run k-means a second time to get the clusters in the original space
... # without target info. Initialize using centroids found in pre-training.
... # Go through a single iteration of cluster assignment and centroid
... # recomputation.
... km_model = KMeans(n_clusters=self.k,
... init=km_model_pretrain.cluster_centers_[:,:2],
... n_init=1,
... max_iter=1)
... km_model.fit(X)
...
... self.km_model = km_model
... self.cluster_centers_ = km_model.cluster_centers_
... return self
...
... def transform(self, X, y=None):
... """Outputs the closest cluster ID for each input data point.
... """
... clusters = self.km_model.predict(X)
... return clusters[:,np.newaxis]
...
... def fit_transform(self, X, y=None):
... self.fit(X, y)
... return self.transform(X, y)
Same as with #292 as there's code example they could be a good starting point.
Thanks for adding in the reference to the book.
The author (of the book you linked in previous comment) mentions that K-mean featurization makes sense on variables where the euclidean distance "makes sense", so should not be applied on categorical variables (page 130 in my version).
If there are categorical variables, she suggests to convert to binning statistics, using the method which she calls in her book "Bin Counting" . That could be one option. To do this, we would first have to create a BinCounting encoder for categorical variables (we would need to create a separate issue for this). Maybe that is a first step for this issue.
There are however, other ways of applying kmeans to categorical data, a summary in this article. There is quite a bit on stackexchange and the internet in general, about how to apply kmeans to categorical variables, but I did not have time o look in more detail.
We need to think on how best to design this transformer based on all this info. A clearer view of which encoding the transformer would offer and what are the pros and limitations. Maybe the transformer should offer different encoding possibilities.
Also, for this transformer, I am thinking of a new module: embedding, since in essence, we will be replacing all categorical variables by a completely new representation. Not sure encoding is the right place. But this deserves more thought. What other transformers would go in this new module?
Reading the book Feature Engineering for Machine Learning from Alice Zhang further Bin counting is in essence target mean encoding, as per example number of clicks / sum (click + non-clicks). So we could actually offer the user the possibility to choose the categorical encoding to use first, ahead of the kmeans featurization.
From the looks of it, it is the same as sklearn.preprocessing.KBinsDiscretizer
with strategy='kmeans'
Hi @GLevV
I didn't check the source code of the KBinsDiscretizer when strategy is kmeans, but according to the docs: "‘kmeans’ strategy defines bins based on a k-means clustering procedure performed on each feature independently". Thus, it would map:
The proposition here is to use all categorical variables to derive the clusters. So it would map from:
I don't think this is what the KBinsDiscretizer is doing. Am I correct?
@solegalli sorry, my bad, I misunderstood OP. It is transformation of k features to 1. So it just regular fit_predict of KMeans on subset of data. Which is something like this:
ColumnTransformer([('kmeans', KMeans(n_clusters=6), [ids of categorical columns])], remainder='passthrough')
@solegalli and @glevv,
Just doing some housekeeping. Given that sklearn has this functionality using ColumnTransformer
, does this mean we should subordinate this task?
Do we have a "low priority" label for Git? Should we add the label to the appropriate tasks, e.g. this one?
I am personally no great fan of the ColumnTransformer. It changes variable names and order.
We should create our own transformer that applies kmeans to the selected categorical variables, after using one of the available encoders, to return the clusters to replace the original variables.
So here we are replacing a group of categorical variables by a group of clusters.
The logic:
Step 1: encode categorical variables using, OHE, ordinal, target mean, to be decided by the user via a parameter Step 2: kmeans (n clusters to be decided by user) Step 3: remove the original variables and add the clusters instead either as an ordinal variable or as ohe (each cluster is a variable with 1 or 0), user decides what they want.
Thank you!
Is your feature request related to a problem? Please describe.
From Kaggle
Describe the solution you'd like
A
KmeansEncoder
transformer that takes as input a set of variables X and return the corresponding cluster label. A basic structure is show below.