Closed ETA444 closed 6 months ago
Implementation Summary:
The uniform_smart
method within transform_cat()
offers advanced cleaning of categorical data by leveraging textual similarity measures and hierarchical clustering to group and normalize similar categories. This is particularly useful for datasets with slight variations in categorical data entries.
Purpose:
The purpose of this method is to ensure categorical data is standardized, cleaned, and grouped appropriately, enhancing data consistency and preparing it for further analysis or modeling.
Code Breakdown:
Method Header:
'uniform_smart'
method implementation and provide context.if method.lower() == 'uniform_smart':
print(f"< UNIFORM SMART TRANSFORMATION* >")
print(f" This method leverages advanced data cleaning techniques for categorical variables, enhancing uniformity across your dataset:")
print(f" ✔ Utilizes the `uniform_simple` method for initial preprocessing steps.")
print(f" ✔ Employs Levenshtein distance to evaluate textual similarity among categories.")
print(f" ✔ Applies hierarchical clustering to group similar categories together.")
print(f" ✔ Selects the most representative category within each cluster to ensure data consistency.")
print(f" ✔ Fills missing values with a placeholder to maintain data integrity. (default 'Unknown', customize with na_placeholder = '...')\n")
Initialize DataFrame and Uniform Columns:
transformed_df = df.copy()
uniform_columns = pd.DataFrame()
Preprocess and Cluster:
for variable in categorical_variables:
categories = (
transformed_df[variable]
.astype(str)
.fillna(na_placeholder)
.str.lower()
.str.strip()
.str.replace('[^a-zA-Z0-9\s]', '', regex=True)
)
unique_categories = categories.unique()
dist_matrix = np.array([[lev.distance(w1, w2) for w1 in unique_categories] for w2 in unique_categories])
max_dist = np.max(dist_matrix)
sim_matrix = 1 - dist_matrix / max_dist if max_dist != 0 else np.zeros_like(dist_matrix)
Z = linkage(sim_matrix, method='complete')
cluster_labels = fcluster(Z, t=0.5, criterion='distance')
representatives = {}
for cluster in np.unique(cluster_labels):
index = np.where(cluster_labels == cluster)[0]
cluster_categories = unique_categories[index]
representative = max(cluster_categories, key=lambda x: (categories == x).sum())
representatives[cluster] = representative
category_to_representative = {cat: representatives[cluster_labels[i]] for i, cat in enumerate(unique_categories)}
transformed_df[variable] = categories.map(category_to_representative)
uniform_columns = pd.concat([uniform_columns, transformed_df[[variable]]], axis=1)
print(f"\n['{variable}'] Category Transformation\n")
print(f"Categories BEFORE transformation ({len(df[variable].unique())}): {df[variable].unique()}\n")
print(f"Categories AFTER transformation ({len(transformed_df[variable].unique())}): {transformed_df[variable].unique()}\n")
Output Results and Sanity Check:
print(f"✔ New transformed dataframe:\n{transformed_df.head()}\n")
print(f"✔ Dataframe with only the uniform columns:\n{uniform_columns.head()}\n")
print("☻ HOW TO: To catch the df's use - `transformed_df, uniform_columns = transform_cat(your_df, your_columns, method='uniform_smart')`.\n")
print("< SANITY CHECK >")
print(f" ➡ Shape of original df: {df.shape}")
print(f" ➡ Shape of transformed df: {transformed_df.shape}\n")
print("* Consider `uniform_simple` for basic data cleaning needs or when processing large datasets where computational efficiency is a concern.")
return transformed_df, uniform_columns
See the Full Function:
You can refer to the complete implementation of the transform_cat()
function, including the 'uniform_smart'
method, on GitHub: transform_cat().
Description:
Method Functionality Idea:
The
uniform_smart
method is designed to preprocess categorical variables using a combination of techniques:uniform_simple
method for initial preprocessing steps.How it operates:
The method iterates through each categorical variable in the dataframe, preprocesses the categories, calculates similarity matrices, performs hierarchical clustering, and maps categories to their cluster representatives. The resulting transformed dataframe and a dataframe containing only the uniform columns are returned.
Usage:
To use the
uniform_smart
method:Sanity Check:
A sanity check is performed to compare the shape of the original dataframe with the transformed dataframe.