ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new transform_cat() method: 'uniform_smart' #4

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Description:


Method Functionality Idea:

The uniform_smart method is designed to preprocess categorical variables using a combination of techniques:

How it operates:

The method iterates through each categorical variable in the dataframe, preprocesses the categories, calculates similarity matrices, performs hierarchical clustering, and maps categories to their cluster representatives. The resulting transformed dataframe and a dataframe containing only the uniform columns are returned.

Usage:

To use the uniform_smart method:

transformed_df, uniform_columns = transform_cat(your_df, your_columns, method='uniform_smart')

Sanity Check:

A sanity check is performed to compare the shape of the original dataframe with the transformed dataframe.

ETA444 commented 6 months ago

Implementation Summary:

The uniform_smart method within transform_cat() offers advanced cleaning of categorical data by leveraging textual similarity measures and hierarchical clustering to group and normalize similar categories. This is particularly useful for datasets with slight variations in categorical data entries.

Purpose:

The purpose of this method is to ensure categorical data is standardized, cleaned, and grouped appropriately, enhancing data consistency and preparing it for further analysis or modeling.

Code Breakdown:

  1. Method Header:

    • Purpose: To clearly indicate the start of the 'uniform_smart' method implementation and provide context.
    if method.lower() == 'uniform_smart':
       print(f"< UNIFORM SMART TRANSFORMATION* >")
       print(f" This method leverages advanced data cleaning techniques for categorical variables, enhancing uniformity across your dataset:")
       print(f"  ✔ Utilizes the `uniform_simple` method for initial preprocessing steps.")
       print(f"  ✔ Employs Levenshtein distance to evaluate textual similarity among categories.")
       print(f"  ✔ Applies hierarchical clustering to group similar categories together.")
       print(f"  ✔ Selects the most representative category within each cluster to ensure data consistency.")
       print(f"  ✔ Fills missing values with a placeholder to maintain data integrity. (default 'Unknown', customize with na_placeholder = '...')\n")
  2. Initialize DataFrame and Uniform Columns:

    • Purpose: To prepare the necessary data structures for transformation.
    transformed_df = df.copy()
    uniform_columns = pd.DataFrame()
  3. Preprocess and Cluster:

    • Purpose: To preprocess, cluster, and select representative categories.
    for variable in categorical_variables:
       categories = (
           transformed_df[variable]
           .astype(str)
           .fillna(na_placeholder)
           .str.lower()
           .str.strip()
           .str.replace('[^a-zA-Z0-9\s]', '', regex=True)
       )
       unique_categories = categories.unique()
    
       dist_matrix = np.array([[lev.distance(w1, w2) for w1 in unique_categories] for w2 in unique_categories])
       max_dist = np.max(dist_matrix)
       sim_matrix = 1 - dist_matrix / max_dist if max_dist != 0 else np.zeros_like(dist_matrix)
    
       Z = linkage(sim_matrix, method='complete')
       cluster_labels = fcluster(Z, t=0.5, criterion='distance')
    
       representatives = {}
       for cluster in np.unique(cluster_labels):
           index = np.where(cluster_labels == cluster)[0]
           cluster_categories = unique_categories[index]
           representative = max(cluster_categories, key=lambda x: (categories == x).sum())
           representatives[cluster] = representative
    
       category_to_representative = {cat: representatives[cluster_labels[i]] for i, cat in enumerate(unique_categories)}
       transformed_df[variable] = categories.map(category_to_representative)
       uniform_columns = pd.concat([uniform_columns, transformed_df[[variable]]], axis=1)
    
       print(f"\n['{variable}'] Category Transformation\n")
       print(f"Categories BEFORE transformation ({len(df[variable].unique())}): {df[variable].unique()}\n")
       print(f"Categories AFTER transformation ({len(transformed_df[variable].unique())}): {transformed_df[variable].unique()}\n")
  4. Output Results and Sanity Check:

    • Purpose: To display the results and perform a sanity check.
    print(f"✔ New transformed dataframe:\n{transformed_df.head()}\n")
    print(f"✔ Dataframe with only the uniform columns:\n{uniform_columns.head()}\n")
    print("☻ HOW TO: To catch the df's use - `transformed_df, uniform_columns = transform_cat(your_df, your_columns, method='uniform_smart')`.\n")
    print("< SANITY CHECK >")
    print(f"  ➡ Shape of original df: {df.shape}")
    print(f"  ➡ Shape of transformed df: {transformed_df.shape}\n")
    print("* Consider `uniform_simple` for basic data cleaning needs or when processing large datasets where computational efficiency is a concern.")
    
    return transformed_df, uniform_columns

See the Full Function:

You can refer to the complete implementation of the transform_cat() function, including the 'uniform_smart' method, on GitHub: transform_cat().